Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FS] Speed-up LocalFile.copy() by using Java-NIO's Files.copy() #1471

Merged

Conversation

HannesWell
Copy link
Member

@HannesWell HannesWell commented Jul 21, 2024

Using Files.copy() can be significantly faster than reading the content from a source InputStream and writing it to a target OutputStream.

The only drawback of using using Files.copy() is that continuous progress reporting during the copy operation is not possible. For small files (which are copied in short time) this is not really a relevant and for large files the performance gain outweighs the loss in progress information.

@HannesWell
Copy link
Member Author

An example of possible speed-ups can be found in https://bugs.eclipse.org/bugs/show_bug.cgi?id=577432. The exact values of course depend on the actual hard-drive and the size of the file(s) copied.
In a little test with an IDE started with the patch where I copied a 1GB file from one project to another, the coping was also noticeably faster.

Copy link
Contributor

github-actions bot commented Jul 21, 2024

Test Results

 1 734 files  ±0   1 734 suites  ±0   1h 33m 37s ⏱️ + 9m 52s
 3 972 tests ±0   3 950 ✅ ±0   22 💤 ±0  0 ❌ ±0 
12 513 runs  ±0  12 352 ✅ ±0  161 💤 ±0  0 ❌ ±0 

Results for commit 86c92ed. ± Comparison against base commit 33aba47.

♻️ This comment has been updated with latest results.

@vogella
Copy link
Contributor

vogella commented Jul 21, 2024

Thanks @HannesWell for making our ide faster

@laeubi
Copy link
Contributor

laeubi commented Jul 21, 2024

The only drawback of using using Files.copy() is that continuous progress reporting during the copy operation is not possible. For small files (which are copied in short time) this is not really a relevant and for large files the performance gain outweighs the loss in progress information.

If one wants (and the size is known to be large) one can use a Thread that reports some progress regularly using SubMonitor where one starts with a large number (e.g. 1000) then report a progress of 1 and set the remaining work back to 1000.

Another alternative would be to observer the target size on the filesystem during the copy operation.

@HannesWell
Copy link
Member Author

Another alternative would be to observer the target size on the filesystem during the copy operation.

Yes that would be feasible, but actually I would like to avoid that additional complexity. With recent hardware copying even large files is relatively quick. E.g. the actual copy process of my 1GB only took one or two seconds (didn't measure it) and the waiting cursor was visible only for a very short time. Of course not everyone has an SSD but at the same time I don't think that copying such large single files happens that often.

But if something like that is implemented, it could also be used for #1475. But if the file-size is not known, it is not very useful to report anything for the sake of reporting something. As far as I know, to relax nervous users the progress par is already animated to give the impression of progress even if nothing is reported.

@HannesWell HannesWell force-pushed the speedup-localFile-copy branch from 6c011dd to 1a9af68 Compare July 22, 2024 20:45
@HannesWell
Copy link
Member Author

So what's the opinion of the other committers about this (and #1475)?
Do you think it is acceptable to have less fine-grained progress reporting (only noticeable for huge files) for better performance (and simpler code in case of #1475)?
I would be fine with this as it is.

Monitoring the size of the target file would add quite a bit of complexity and would also not be possible in all cases of #1475 since for example we don't know the exact length of the content that can be read from the InputStream.

If we agree to have this I suggest to submit this in time before M2 on Friday.

@iloveeclipse
Copy link
Member

Do you think it is acceptable to have less fine-grained progress reporting (only noticeable for huge files) for better performance (and simpler code in case of #1475)?

I haven't had time to read the patch, but I would be fine with that above.

@vogella
Copy link
Contributor

vogella commented Jul 23, 2024

Do you think it is acceptable to have less fine-grained progress reporting

Yes, faster is more important

Copy link
Contributor

@HeikoKlare HeikoKlare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the proposed change as well, both from the conceptual as well as the technical side. I only have a minor comment regarding error generation.

@HannesWell HannesWell force-pushed the speedup-localFile-copy branch from 1a9af68 to fcd86c9 Compare July 23, 2024 07:31
@HannesWell
Copy link
Member Author

HannesWell commented Jul 23, 2024

Thank you all for your positive feedback. If there are no objections in the meantime, I plan to submit this and #1475 this (European) evening. Then we have it in time for M2.

Copy link
Contributor

@HeikoKlare HeikoKlare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on these improvement, Hannes!

Using Files.copy() can be significantly faster than reading the content
from a source InputStream and writing it to a target OutputStream.

The only drawback of using using Files.copy() is that continuous
progress reporting during the copy operation is not possible. For small
files (which are copied in short time) this is not really a relevant and
for large files the performance gain outweighs the loss of fine-grained
progress information.
@HannesWell HannesWell force-pushed the speedup-localFile-copy branch from fcd86c9 to 86c92ed Compare July 23, 2024 16:55
@HannesWell HannesWell merged commit 034c7cc into eclipse-platform:master Jul 23, 2024
16 checks passed
@HannesWell HannesWell deleted the speedup-localFile-copy branch July 23, 2024 21:04
@jukzi
Copy link
Contributor

jukzi commented Aug 12, 2024

@HannesWell Please prove it is faster for small files or revert. I tried the same "optimization" multiple times in the past years (last example #1452 (comment)) and i always had to conclude by measurements that Files.copy() was slower on windows.
I don't know why exactly it is slower. My guesses is that JDK does more checks or copies more (unneeded) attributes. I doubt it magically changed in the past month.
Why do i measure small files? This operation is heavily used during full build to copy a lot resources to target.

@HannesWell
Copy link
Member Author

HannesWell commented Aug 12, 2024

@HannesWell Please prove it is faster for small files or revert. I tried the same "optimization" multiple times in the past years (last example #1452 (comment)) and i always had to conclude by measurements that Files.copy() was slower on windows.

It would have been more useful if you would provide more details about your benchmarks. Otherwise it's just a statement nobody can discuss. As you probably know, statements about performance are often not absolute and usually context depended. For example for small files I can indeed reproduce that Files.copy() is slower than just reading+writing/transferring the bytes, but when files get larger the situation is the opposite.
So in this case should find a solution that works in all cases or have to decide what's more important for us.

I don't know why exactly it is slower. My guesses is that JDK does more checks or copies more (unneeded) attributes. I doubt it magically changed in the past month. Why do i measure small files? This operation is heavily used during full build to copy a lot resources to target.

From looking into the JDK code, I assume this is probably because Files.copy() reads the file attributes of the source and target file (may not exist, then causes an exception to be thrown and catched internally) and only then calls the native Windows method copyFileEx():

I have now spend more time on this and created a very simple benchmark that uses the code below in the displayed variations(I know there are many uncertainties in it and one should use JMH. So the numbers have probably quite some errors!) :

public class FileCopyTests {
	private static final Path ROOT = Path.of("C:\\fileCopyBenchmark");
	private static final long TOTAL_FILE_SIZE = 10_000_000_000L;
	private static final Random RND = new Random();

	public static void main(String[] args) throws IOException {
		createFilesForCaseEnvironment(1024 * 1024 * 1024, "1G"); //$NON-NLS-1$

		var sourceDir = ROOT.resolve("1G").toFile();
		var targetDir = Files.createDirectories(ROOT.resolve("target_1G-3")).toFile();
		long start = System.currentTimeMillis();
		for (int i = 0; i < 9; i++) {
			String fileName = "f" + i;
			transferStreams(new File(sourceDir, fileName), new File(targetDir, fileName));
		}
		System.out.println((System.currentTimeMillis() - start) + "ms");
	}

	private static void copyStreams(File sourcePath, File targetPath) throws IOException {
		byte[] buffer = new byte[8192];
		try (var source = new FileInputStream(sourcePath); //
				var destination = new FileOutputStream(targetPath);) {
			while (true) {
				int bytesRead = source.read(buffer);
				if (bytesRead == -1) {
					break;
				}
				destination.write(buffer, 0, bytesRead);
			}
		}
	}

	private static void transferStreams(File sourcePath, File targetPath) throws IOException {
		byte[] buffer = new byte[8192];
		try (var source = new FileInputStream(sourcePath); //
				var destination = new FileOutputStream(targetPath);) {
			source.transferTo(destination);
		}
	}
	private static void filesCopy(Path sourcePath, Path targetPath) throws IOException {
		Files.copy(sourcePath, targetPath, StandardCopyOption.REPLACE_EXISTING);
	}

	private static void createFilesForCaseEnvironment(int singleFileSize, String containerName) throws IOException {
		Path path = Files.createDirectories(ROOT.resolve(containerName));
		long fileCount = TOTAL_FILE_SIZE / singleFileSize;

		for (long i = 0; i < fileCount; i++) {
			byte[] content = new byte[singleFileSize];
			RND.nextBytes(content);
			Files.write(path.resolve("f" + i), content); //$NON-NLS-1$
		}
	}
}

These are the numbers I get for the tested scenarios for different file-sizes (Reminder: the times probably have a relatively large error):

1kB file-size (100MB in total):
  stream-read+write/copy: 30345ms
  stream-transfer: 30052ms
  stream-transfer(Java-21): 33440ms
  Files.copy: 48749ms

1MB file-size (2GB total):
  stream-read+write/copy: 2886ms
  stream-transfer: 2301ms
  stream-transfer(Java-21): 1387ms
  Files.copy: 1496ms

1GB file-size (10GB total):
  stream-read+write/copy: 9038ms
  stream-transfer: 9256ms
  stream-transfer(Java-21): 6572ms
  Files.copy: 3139ms

In Java-21 FileInputStream.transfer() is optimized to use FileChannel.transfer*() if the OutputStream is a FileOutputStream.

@HannesWell
Copy link
Member Author

HannesWell commented Aug 12, 2024

So in this case should find a solution that works in all cases or have to decide what's more important for us.

I think we have two options here:

  1. Decide based on the file-size (we already have read the length into the IFileInfo) if InputStream.transferTo() (used since [FS] Use InputStream.transferTo() instead of custom transfers #1475) or Files.copy() should be used.
  2. Call the native method copyFileExW() or copyFile() directly.
    It would then probably be best to add a new method NativeHandler.copyFileContent() so that it can be implemented on all platforms.

Maybe the latter is more performant even for small files since we save the source/target attribute read (has to be checked), but it is more work and I don't think I have the time to land that for the upcoming release.
So my suggestion is to go with option 1 for now and try out 2 later.

@jukzi
Copy link
Contributor

jukzi commented Aug 13, 2024

It would have been more useful if you would provide more details about your benchmarks. Otherwise it's just a statement
nobody can discuss.

i used a jmh benchmark:

package benchmarks.file;

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.file.attribute.BasicFileAttributes;
import java.util.concurrent.ForkJoinPool;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.TearDown;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import com.google.common.io.Files;

@BenchmarkMode(org.openjdk.jmh.annotations.Mode.SingleShotTime)
@OutputTimeUnit(java.util.concurrent.TimeUnit.MILLISECONDS)
@State(org.openjdk.jmh.annotations.Scope.Thread)
@Fork(value = 0, jvmArgsAppend = { "-ea" })
@Warmup(time = 50, timeUnit = java.util.concurrent.TimeUnit.MILLISECONDS)
@Measurement(batchSize = 1, iterations = 100)
public class FileCopy {
	@Param({ "100" })
	int files;
	@Param({ "5000", "50000", "500000" })
	int size;
	@Param({ "out%03d.tmp" })
	String fileName;
	String dirName = "tmp/";
	java.io.File f[];

	byte[] data1;
	boolean dataSet[];

	ForkJoinPool pool;

	@Setup(org.openjdk.jmh.annotations.Level.Iteration)
	public void setup() throws IOException {
		f = new java.io.File[files];
		for (int i = 0; i < files; i++) {
			f[i] = new java.io.File(dirName + String.format(fileName, i)).getAbsoluteFile();
			f[i].delete();
		}
		dataSet = new boolean[files];
		data1 = randomBytes(size);
		Files.write(data1, f[0]);
	}

	public static byte[] randomBytes(int length) {
		byte[] d = new byte[length];
		for (int i = 0; i < d.length; i++) {
			d[i] = (byte) (Math.random() * 256);
		}
		return d;
	}

	@TearDown
	public void cleanup() {
		for (int i = 0; i < files; i++) {
			f[i].delete();
		}
	}

	@Benchmark
	public void copy() throws IOException {
		for (int i = 1; i < files; i++) {
			java.nio.file.Files.copy(f[0].toPath(), f[i].toPath(), java.nio.file.StandardCopyOption.COPY_ATTRIBUTES,
					java.nio.file.StandardCopyOption.REPLACE_EXISTING);
		}
	}

	@Benchmark
	public void copyStream() throws IOException {
		for (int i = 1; i < files; i++) {
			FileInputStream in = new FileInputStream(f[0]);
			FileOutputStream out = new FileOutputStream(f[i]);
			in.transferTo(out);
			out.close();
			in.close();
			BasicFileAttributes att = java.nio.file.Files.readAttributes(f[0].toPath(), BasicFileAttributes.class);
			f[i].setLastModified(att.creationTime().toMillis());
		}
	}

	public static void main(String[] args) throws RunnerException {
		Options opt = new OptionsBuilder().include(FileCopy.class.getSimpleName()).shouldFailOnError(true).build();
//		 new org.openjdk.jmh.runner.Runner(opt).run();
		helper.ColumResultFormat.run(opt);
	}

}

image

I doubt that copying a 1GB file is the standard usecase that we should optimize for. Based on the workspaces i know, resources rather have a typical size for 5kb to 500kb. However as you can see in my benchmark result the copy time does only slightly depend of the actual file size in that range. It's a fast memory copy only anyway, while opening the file in the filesystem is the most costly thing on my computer.

Option 3: just revert your changes.

@HannesWell
Copy link
Member Author

I doubt that copying a 1GB file is the standard usecase that we should optimize for. Based on the workspaces i know, resources rather have a typical size for 5kb to 500kb.

Smaller files are probably copied way more often.

However as you can see in my benchmark result the copy time does only slightly depend of the actual file size in that range. It's a fast memory copy only anyway, while opening the file in the filesystem is the most costly thing on my computer.

In general there are really a lot of variables that make it hard to make solid statements about the performance of file-IO and its relations. There are just too many factors. Does one use a hard-disk drives, a S-ATA SSD or an NVMe SSDs? With OS and which file-stystem is used. E.g. on Windows also activity of antivirus software or the Windows the Search-Index can temporarily slow down IO operations drastically (can make bench-marking a challenge).
And then at least Files.copy() can perform differently if the copy target already exists because the target is then deleted explicitly, which costs time again.

Option 3: just revert your changes.

That's an option, but I would be in favor of one that gives us the best for both cases:

I'll later provide a follow-up with an implementation that calls native copy methods. But it's something for the next release cycle to complete it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants