[FS] Speed-up LocalFile.copy() by using Java-NIO's Files.copy() #1471

HannesWell · 2024-07-21T09:18:06Z

Using Files.copy() can be significantly faster than reading the content from a source InputStream and writing it to a target OutputStream.

The only drawback of using using Files.copy() is that continuous progress reporting during the copy operation is not possible. For small files (which are copied in short time) this is not really a relevant and for large files the performance gain outweighs the loss in progress information.

HannesWell · 2024-07-21T09:30:44Z

An example of possible speed-ups can be found in https://bugs.eclipse.org/bugs/show_bug.cgi?id=577432. The exact values of course depend on the actual hard-drive and the size of the file(s) copied.
In a little test with an IDE started with the patch where I copied a 1GB file from one project to another, the coping was also noticeably faster.

github-actions · 2024-07-21T09:44:49Z

Test Results

1 734 files ±0 1 734 suites ±0 1h 33m 37s ⏱️ + 9m 52s
3 972 tests ±0 3 950 ✅ ±0 22 💤 ±0 0 ❌ ±0
12 513 runs ±0 12 352 ✅ ±0 161 💤 ±0 0 ❌ ±0

Results for commit 86c92ed. ± Comparison against base commit 33aba47.

♻️ This comment has been updated with latest results.

vogella · 2024-07-21T10:19:05Z

Thanks @HannesWell for making our ide faster

laeubi · 2024-07-21T10:24:32Z

The only drawback of using using Files.copy() is that continuous progress reporting during the copy operation is not possible. For small files (which are copied in short time) this is not really a relevant and for large files the performance gain outweighs the loss in progress information.

If one wants (and the size is known to be large) one can use a Thread that reports some progress regularly using SubMonitor where one starts with a large number (e.g. 1000) then report a progress of 1 and set the remaining work back to 1000.

Another alternative would be to observer the target size on the filesystem during the copy operation.

HannesWell · 2024-07-21T14:11:32Z

Another alternative would be to observer the target size on the filesystem during the copy operation.

Yes that would be feasible, but actually I would like to avoid that additional complexity. With recent hardware copying even large files is relatively quick. E.g. the actual copy process of my 1GB only took one or two seconds (didn't measure it) and the waiting cursor was visible only for a very short time. Of course not everyone has an SSD but at the same time I don't think that copying such large single files happens that often.

But if something like that is implemented, it could also be used for #1475. But if the file-size is not known, it is not very useful to report anything for the sake of reporting something. As far as I know, to relax nervous users the progress par is already animated to give the impression of progress even if nothing is reported.

HannesWell · 2024-07-22T21:12:04Z

So what's the opinion of the other committers about this (and #1475)?
Do you think it is acceptable to have less fine-grained progress reporting (only noticeable for huge files) for better performance (and simpler code in case of #1475)?
I would be fine with this as it is.

Monitoring the size of the target file would add quite a bit of complexity and would also not be possible in all cases of #1475 since for example we don't know the exact length of the content that can be read from the InputStream.

If we agree to have this I suggest to submit this in time before M2 on Friday.

iloveeclipse · 2024-07-22T21:50:18Z

Do you think it is acceptable to have less fine-grained progress reporting (only noticeable for huge files) for better performance (and simpler code in case of #1475)?

I haven't had time to read the patch, but I would be fine with that above.

vogella · 2024-07-23T06:34:39Z

Do you think it is acceptable to have less fine-grained progress reporting

Yes, faster is more important

HeikoKlare

I am fine with the proposed change as well, both from the conceptual as well as the technical side. I only have a minor comment regarding error generation.

...es/org.eclipse.core.filesystem/src/org/eclipse/core/internal/filesystem/local/LocalFile.java

HannesWell · 2024-07-23T07:34:41Z

Thank you all for your positive feedback. If there are no objections in the meantime, I plan to submit this and #1475 this (European) evening. Then we have it in time for M2.

HeikoKlare

Thank you for working on these improvement, Hannes!

...les/org.eclipse.core.filesystem/src/org/eclipse/core/internal/filesystem/messages.properties

Using Files.copy() can be significantly faster than reading the content from a source InputStream and writing it to a target OutputStream. The only drawback of using using Files.copy() is that continuous progress reporting during the copy operation is not possible. For small files (which are copied in short time) this is not really a relevant and for large files the performance gain outweighs the loss of fine-grained progress information.

jukzi · 2024-08-12T15:34:21Z

@HannesWell Please prove it is faster for small files or revert. I tried the same "optimization" multiple times in the past years (last example #1452 (comment)) and i always had to conclude by measurements that Files.copy() was slower on windows.
I don't know why exactly it is slower. My guesses is that JDK does more checks or copies more (unneeded) attributes. I doubt it magically changed in the past month.
Why do i measure small files? This operation is heavily used during full build to copy a lot resources to target.

HannesWell · 2024-08-12T23:44:34Z

@HannesWell Please prove it is faster for small files or revert. I tried the same "optimization" multiple times in the past years (last example #1452 (comment)) and i always had to conclude by measurements that Files.copy() was slower on windows.

It would have been more useful if you would provide more details about your benchmarks. Otherwise it's just a statement nobody can discuss. As you probably know, statements about performance are often not absolute and usually context depended. For example for small files I can indeed reproduce that Files.copy() is slower than just reading+writing/transferring the bytes, but when files get larger the situation is the opposite.
So in this case should find a solution that works in all cases or have to decide what's more important for us.

I don't know why exactly it is slower. My guesses is that JDK does more checks or copies more (unneeded) attributes. I doubt it magically changed in the past month. Why do i measure small files? This operation is heavily used during full build to copy a lot resources to target.

From looking into the JDK code, I assume this is probably because Files.copy() reads the file attributes of the source and target file (may not exist, then causes an exception to be thrown and catched internally) and only then calls the native Windows method copyFileEx():

I have now spend more time on this and created a very simple benchmark that uses the code below in the displayed variations(I know there are many uncertainties in it and one should use JMH. So the numbers have probably quite some errors!) :

public class FileCopyTests {
	private static final Path ROOT = Path.of("C:\\fileCopyBenchmark");
	private static final long TOTAL_FILE_SIZE = 10_000_000_000L;
	private static final Random RND = new Random();

	public static void main(String[] args) throws IOException {
		createFilesForCaseEnvironment(1024 * 1024 * 1024, "1G"); //$NON-NLS-1$

		var sourceDir = ROOT.resolve("1G").toFile();
		var targetDir = Files.createDirectories(ROOT.resolve("target_1G-3")).toFile();
		long start = System.currentTimeMillis();
		for (int i = 0; i < 9; i++) {
			String fileName = "f" + i;
			transferStreams(new File(sourceDir, fileName), new File(targetDir, fileName));
		}
		System.out.println((System.currentTimeMillis() - start) + "ms");
	}

	private static void copyStreams(File sourcePath, File targetPath) throws IOException {
		byte[] buffer = new byte[8192];
		try (var source = new FileInputStream(sourcePath); //
				var destination = new FileOutputStream(targetPath);) {
			while (true) {
				int bytesRead = source.read(buffer);
				if (bytesRead == -1) {
					break;
				}
				destination.write(buffer, 0, bytesRead);
			}
		}
	}

	private static void transferStreams(File sourcePath, File targetPath) throws IOException {
		byte[] buffer = new byte[8192];
		try (var source = new FileInputStream(sourcePath); //
				var destination = new FileOutputStream(targetPath);) {
			source.transferTo(destination);
		}
	}
	private static void filesCopy(Path sourcePath, Path targetPath) throws IOException {
		Files.copy(sourcePath, targetPath, StandardCopyOption.REPLACE_EXISTING);
	}

	private static void createFilesForCaseEnvironment(int singleFileSize, String containerName) throws IOException {
		Path path = Files.createDirectories(ROOT.resolve(containerName));
		long fileCount = TOTAL_FILE_SIZE / singleFileSize;

		for (long i = 0; i < fileCount; i++) {
			byte[] content = new byte[singleFileSize];
			RND.nextBytes(content);
			Files.write(path.resolve("f" + i), content); //$NON-NLS-1$
		}
	}
}

These are the numbers I get for the tested scenarios for different file-sizes (Reminder: the times probably have a relatively large error):

1kB file-size (100MB in total):
  stream-read+write/copy: 30345ms
  stream-transfer: 30052ms
  stream-transfer(Java-21): 33440ms
  Files.copy: 48749ms

1MB file-size (2GB total):
  stream-read+write/copy: 2886ms
  stream-transfer: 2301ms
  stream-transfer(Java-21): 1387ms
  Files.copy: 1496ms

1GB file-size (10GB total):
  stream-read+write/copy: 9038ms
  stream-transfer: 9256ms
  stream-transfer(Java-21): 6572ms
  Files.copy: 3139ms

In Java-21 FileInputStream.transfer() is optimized to use FileChannel.transfer*() if the OutputStream is a FileOutputStream.

HannesWell · 2024-08-12T23:52:40Z

So in this case should find a solution that works in all cases or have to decide what's more important for us.

I think we have two options here:

Decide based on the file-size (we already have read the length into the IFileInfo) if InputStream.transferTo() (used since [FS] Use InputStream.transferTo() instead of custom transfers #1475) or Files.copy() should be used.
Call the native method copyFileExW() or copyFile() directly.
It would then probably be best to add a new method NativeHandler.copyFileContent() so that it can be implemented on all platforms.

Maybe the latter is more performant even for small files since we save the source/target attribute read (has to be checked), but it is more work and I don't think I have the time to land that for the upcoming release.
So my suggestion is to go with option 1 for now and try out 2 later.

jukzi · 2024-08-13T08:01:45Z

It would have been more useful if you would provide more details about your benchmarks. Otherwise it's just a statement
nobody can discuss.

i used a jmh benchmark:

package benchmarks.file;

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.file.attribute.BasicFileAttributes;
import java.util.concurrent.ForkJoinPool;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.TearDown;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import com.google.common.io.Files;

@BenchmarkMode(org.openjdk.jmh.annotations.Mode.SingleShotTime)
@OutputTimeUnit(java.util.concurrent.TimeUnit.MILLISECONDS)
@State(org.openjdk.jmh.annotations.Scope.Thread)
@Fork(value = 0, jvmArgsAppend = { "-ea" })
@Warmup(time = 50, timeUnit = java.util.concurrent.TimeUnit.MILLISECONDS)
@Measurement(batchSize = 1, iterations = 100)
public class FileCopy {
	@Param({ "100" })
	int files;
	@Param({ "5000", "50000", "500000" })
	int size;
	@Param({ "out%03d.tmp" })
	String fileName;
	String dirName = "tmp/";
	java.io.File f[];

	byte[] data1;
	boolean dataSet[];

	ForkJoinPool pool;

	@Setup(org.openjdk.jmh.annotations.Level.Iteration)
	public void setup() throws IOException {
		f = new java.io.File[files];
		for (int i = 0; i < files; i++) {
			f[i] = new java.io.File(dirName + String.format(fileName, i)).getAbsoluteFile();
			f[i].delete();
		}
		dataSet = new boolean[files];
		data1 = randomBytes(size);
		Files.write(data1, f[0]);
	}

	public static byte[] randomBytes(int length) {
		byte[] d = new byte[length];
		for (int i = 0; i < d.length; i++) {
			d[i] = (byte) (Math.random() * 256);
		}
		return d;
	}

	@TearDown
	public void cleanup() {
		for (int i = 0; i < files; i++) {
			f[i].delete();
		}
	}

	@Benchmark
	public void copy() throws IOException {
		for (int i = 1; i < files; i++) {
			java.nio.file.Files.copy(f[0].toPath(), f[i].toPath(), java.nio.file.StandardCopyOption.COPY_ATTRIBUTES,
					java.nio.file.StandardCopyOption.REPLACE_EXISTING);
		}
	}

	@Benchmark
	public void copyStream() throws IOException {
		for (int i = 1; i < files; i++) {
			FileInputStream in = new FileInputStream(f[0]);
			FileOutputStream out = new FileOutputStream(f[i]);
			in.transferTo(out);
			out.close();
			in.close();
			BasicFileAttributes att = java.nio.file.Files.readAttributes(f[0].toPath(), BasicFileAttributes.class);
			f[i].setLastModified(att.creationTime().toMillis());
		}
	}

	public static void main(String[] args) throws RunnerException {
		Options opt = new OptionsBuilder().include(FileCopy.class.getSimpleName()).shouldFailOnError(true).build();
//		 new org.openjdk.jmh.runner.Runner(opt).run();
		helper.ColumResultFormat.run(opt);
	}

}

I doubt that copying a 1GB file is the standard usecase that we should optimize for. Based on the workspaces i know, resources rather have a typical size for 5kb to 500kb. However as you can see in my benchmark result the copy time does only slightly depend of the actual file size in that range. It's a fast memory copy only anyway, while opening the file in the filesystem is the most costly thing on my computer.

Option 3: just revert your changes.

HannesWell · 2024-08-13T21:54:08Z

I doubt that copying a 1GB file is the standard usecase that we should optimize for. Based on the workspaces i know, resources rather have a typical size for 5kb to 500kb.

Smaller files are probably copied way more often.

However as you can see in my benchmark result the copy time does only slightly depend of the actual file size in that range. It's a fast memory copy only anyway, while opening the file in the filesystem is the most costly thing on my computer.

In general there are really a lot of variables that make it hard to make solid statements about the performance of file-IO and its relations. There are just too many factors. Does one use a hard-disk drives, a S-ATA SSD or an NVMe SSDs? With OS and which file-stystem is used. E.g. on Windows also activity of antivirus software or the Windows the Search-Index can temporarily slow down IO operations drastically (can make bench-marking a challenge).
And then at least Files.copy() can perform differently if the copy target already exists because the target is then deleted explicitly, which costs time again.

Option 3: just revert your changes.

That's an option, but I would be in favor of one that gives us the best for both cases:

[FS] Copy small files using InputStream.transferTo() #1502

I'll later provide a follow-up with an implementation that calls native copy methods. But it's something for the next release cycle to complete it.

HannesWell force-pushed the speedup-localFile-copy branch from 91957a4 to 6c011dd Compare July 21, 2024 12:17

HannesWell added the performance label Jul 21, 2024

HannesWell force-pushed the speedup-localFile-copy branch from 6c011dd to 1a9af68 Compare July 22, 2024 20:45

HeikoKlare reviewed Jul 23, 2024

View reviewed changes

...es/org.eclipse.core.filesystem/src/org/eclipse/core/internal/filesystem/local/LocalFile.java Outdated Show resolved Hide resolved

HannesWell force-pushed the speedup-localFile-copy branch from 1a9af68 to fcd86c9 Compare July 23, 2024 07:31

HeikoKlare approved these changes Jul 23, 2024

View reviewed changes

szarnekow reviewed Jul 23, 2024

View reviewed changes

...les/org.eclipse.core.filesystem/src/org/eclipse/core/internal/filesystem/messages.properties Outdated Show resolved Hide resolved

HannesWell force-pushed the speedup-localFile-copy branch from fcd86c9 to 86c92ed Compare July 23, 2024 16:55

HannesWell merged commit 034c7cc into eclipse-platform:master Jul 23, 2024
16 checks passed

HannesWell deleted the speedup-localFile-copy branch July 23, 2024 21:04

HannesWell mentioned this pull request Aug 13, 2024

[FS] Copy small files using InputStream.transferTo() #1502

Merged

HannesWell mentioned this pull request Aug 14, 2024

TestBug323833 fails on Mac since I20240729-1820 #1504

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FS] Speed-up LocalFile.copy() by using Java-NIO's Files.copy() #1471

[FS] Speed-up LocalFile.copy() by using Java-NIO's Files.copy() #1471

HannesWell commented Jul 21, 2024 •

edited

Loading

HannesWell commented Jul 21, 2024

github-actions bot commented Jul 21, 2024 •

edited

Loading

vogella commented Jul 21, 2024

laeubi commented Jul 21, 2024

HannesWell commented Jul 21, 2024

HannesWell commented Jul 22, 2024

iloveeclipse commented Jul 22, 2024

vogella commented Jul 23, 2024

HeikoKlare left a comment

HannesWell commented Jul 23, 2024 •

edited

Loading

HeikoKlare left a comment

jukzi commented Aug 12, 2024

HannesWell commented Aug 12, 2024 •

edited

Loading

HannesWell commented Aug 12, 2024 •

edited

Loading

jukzi commented Aug 13, 2024

HannesWell commented Aug 13, 2024

[FS] Speed-up LocalFile.copy() by using Java-NIO's Files.copy() #1471

[FS] Speed-up LocalFile.copy() by using Java-NIO's Files.copy() #1471

Conversation

HannesWell commented Jul 21, 2024 • edited Loading

HannesWell commented Jul 21, 2024

github-actions bot commented Jul 21, 2024 • edited Loading

Test Results

vogella commented Jul 21, 2024

laeubi commented Jul 21, 2024

HannesWell commented Jul 21, 2024

HannesWell commented Jul 22, 2024

iloveeclipse commented Jul 22, 2024

vogella commented Jul 23, 2024

HeikoKlare left a comment

Choose a reason for hiding this comment

HannesWell commented Jul 23, 2024 • edited Loading

HeikoKlare left a comment

Choose a reason for hiding this comment

jukzi commented Aug 12, 2024

HannesWell commented Aug 12, 2024 • edited Loading

HannesWell commented Aug 12, 2024 • edited Loading

jukzi commented Aug 13, 2024

HannesWell commented Aug 13, 2024

HannesWell commented Jul 21, 2024 •

edited

Loading

github-actions bot commented Jul 21, 2024 •

edited

Loading

HannesWell commented Jul 23, 2024 •

edited

Loading

HannesWell commented Aug 12, 2024 •

edited

Loading

HannesWell commented Aug 12, 2024 •

edited

Loading