-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sftp clients based on the Mina-sshd and JSCH components upload and download files at very different speeds #524
Comments
Cannot reproduce on Mac OS X, running an Apache MINA sshd SFTP client against an OpenSSH SFTP server running in an Alpine container.
|
I found that the write() method of the SftpOutputStreamAsync class seems to check every sftp packet; And then SftpRemotePathChannel has a transferTo() method that looks like it's aggregated internally to check. I am not sure if this is the case, and then I would like to ask the use of transforTo() or transforFrom(). Another scenario is how to handle the sftp ls command in the case of a large number of files or directories like transforTo()? |
On Windows, I see inconsistent timings, but indeed most of the time the file transfer with Apache MINA sshd appears to be about 50% slower than with JSch. |
Do we have plans to optimize this feature? I found that every time the data is sent, it needs to be confirmed, but jsch seems to confirm the same |
This is something else, and indeed I have an idea about providing asynchronous SFTP client interfaces. But this is unrelated to the question of why file transfer on Windows is slower than expected, but apparently not on OS X. Off-hand I don't see where the bottleneck might be; it should not be the ACK handling in SftpOutputStreamAsync: that one only checks ACKs that have already arrived. This may need extensive profiling or log analysis. |
This problem definitely is specific to Windows. It does not occur on OS X. (Didn't test Linux.) The problem occurs with both the NIO2 and the netty I/O back-end. (Didn't try MINA.) Performance with either back-end is the same. So it's not an I/O back-end problem. The problem also definitely is not in the SFTP layer. I performed and timed a direct file upload via the equivalent of
This uploads the file without SFTP, so there is zero overhead for the SFTP protocol and its ACKs. Timing this (repeatedly) showed no significant differences from normal uploads via SFTP. Which means that the bottleneck is not in the SFTP code. (Assuming that on the server side using "cat" is not slower than "internal-sftp". The server again is OpenSSH running in an alpine:latest container.) As mentioned before, the timings are fairly inconsistent from run to run, too. (For both JSch and Apache MINA sshd.) In some runs JSch is slow, too, and Apache MINA sshd is actually faster. In some runs, both are fast (and about equal). In most runs, though, Apache MINA sshd is noticeably slower. Might be a threading issue, or a memory issue. Or a channel window issue (though then I'd expect to see the problem also on OS X). But whatever it is, it isn't in the SFTP code. |
Thank you for your answer, I still need to learn more knowledge and then analyze this problem |
I run tests using a wide variety of methods, including transferFrom. I do not get consistently good timings with that either on Windows, but it is generally one of the faster methods. Which is actually a bit surprising, since as you may notice it uses |
Hi @tomaswolf. Could you please advise how I can optimise Mina setup (I use default one) to reach same performance as JSCH?
|
@ryabovgs : no, I cannot. If you read the comments in this issue, you see that currently I only know that there is a performance problem on Windows (and apparently Windows only), but I don't know yet why. I normally develop on OS X, and I have only temporary and sporadic access to a Windows machine. So it may be a while until I figure this out. |
Unroll the permutation loop of ChaCha20 and simplify the crypt() method to work on bytes instead of ints. Unroll and optimize the pack/unpack methods in ChaCha20 and in Poly1305. Don't use Integer.toUnsignedLong(): avoid the extra method call by doing (int & 0xFFFF_FFFFL) inline. In Poly1305.processBlock(), use ints instead of longs where possible (variables t0-t3). All this brings a speed-up of about 40% for the encryption/decryption. Of the time spent in ChaCha20-Poly1305 about one third is spent in the ChaChaEngine and two thirds in Poly1305.processBlock(). Bug: apache#524
For ByteArrayBuffer, optimize the getInt()/putInt() operations to work directly in the buffer without copying into/from a local work area. Simplify some of the operations; in particular, avoid extending bytes to long and then shifting. Bug: apache#524
In ChannelAsyncOutputStream, do not create a new SSH packet if the caller already had passed in an SSH packet buffer, and we're writing it fully. This avoids allocating a new buffer, which gives GC less to do, and it also avoid needlessly copying the data around. Instead the prepared SSH packet buffer is encoded and written directly. Bug: apache#524
Give SftpOutPutStreamAsync a transferFrom operation to be able to read directly into a prepared SSH packet uffer. Previously, we always had to read into a general byte buffer, and then copy into the SSH packet. This extra data copy can be avoided. If the copy buffer size is such that the data fits into a single SSH packet, then this packet buffer can be encoded and sent directly without any further data copying. Simplify the implementation: since the stream always uses a PacketBuffer the code paths for other general byte buffers can simply be removed. Also enable reading the next buffer while the previous one is being sent. Use two buffers alternatingly; one being sent, the other being filled. Use this implementation also in SftpRemotePathChannel in its transferFrom() implementation. _Do_ close the stream there, otherwise the final ACKs may not be checked. Bug: apache#524
Bouncy Castle apparently only has native implementations in its "LTS" releases. BC 1.78.1 has none. SunJCE uses native code. The "security registrar" architecture in Apache MINA sshd prefers registered providers over the platform providers, and it automatically registers Bouncy Castle if it's present. So with Bouncy Castle on the classpath and an SSH connection for which an AES cipher was specified, Apache MINA sshd would use the much slower Bouncy Castle AES. Add a new "SunJCEWrapper" registrar that brings the SunJCE AES and HmacSHA* to the front of the list, so that they are used even if Bouncy Castle is also registered. The new registrar can be disabled via a system property as usual, and it is only enabled if the JVM indeed has a "SunJCE" security provider registered. See also https://issues.apache.org/jira/browse/SSHD-719 (issue closed as "not needed"). Bug: apache#524
I found a couple of things that could be improved; for details see PR #530. While comparing out-of-the-box performance of JSch and sshd is legitimate, one needs to be aware that they are using different encryption algorithms by default. JSch uses the AES implementation from the built-in SunJCE provider (that's native code), sshd uses a Java implementation of the ChaCha20-Poly1305 cipher (if the server supports it, as does OpenSSH). So, three findings:
I think we basically have two kinds of reports:
Performance testing SFTP uploads is difficult. I used JMH benchmarks measuring the performance of only the upload itself (excluding SSH session setup, authentication, and setting up the SFTP channel), with a local JSch or sshd client uploading a 20MB random file to an OpenSSH server. To really benchmark this, one would need a lab setup with a controlled environment: a second machine running the SFTP server, and a local network connection with stable and known throughput and latency. I do not have such a setup. I used mostly a local container running the SFTP server, which of course introduced uncontrollable variation in timings. After all, the same machine then also runs the server, in a container, so there's unpredictable virtualization overhead. The timings I got from this were still surprisingly consistent and enabled me to find the above three problems. With PR #530 applied, JSch and sshd perform equally well in these tests for me. (Uploading 20MBs takes generally about 270-290ms on the machine I tested.) Just to double-check I also ran the same benchmarks with an AWS EC2 t4g.small instance for the OpenSSH server. Again, JSch and sshd with PR #530 applied performed equal -- uploading 20MB took 4-5 seconds, but there were some outliers with both JSch or sshd when it took 6-8, or 12, or once even 20 seconds. Of course, these uploads went over the general Internet (and had cross at least two firewalls, both beyond my control), so I cannot possibly say what happened in these outlier cases. There were about 3 such outlier per 15 upload attempts. In both setups I still have the feeling that JSch is a teeny-weeny bit faster (if both are using AES), maybe 2-3%, but that's a feeling only; the statistics don't really support that. Variability and standard deviation in the timings is too high. If the feeling is true, though, it might be due to the multi-threaded nature of sshd. Multi-threading makes a lot of sense for an SSH server, but for a client, the single-threaded approach of JSch has perhaps slightly less overhead. In any case I do not see any significant differences anymore. Feel free to run your own benchmarks on the code from PR #530 and to report the setups and results. |
Thank you very much for your help. I will try my best to use the code you provide to conduct some experiments and tests in the future, and then give you feedback |
The update() implementation copied _all_ bytes (successively) first into an internal 16-byte buffer and then processed that buffer. This is no needed if the input is long. Use the internal 16-byte buffer only for inputs shorter than 16 bytes, or if there is a leftover of less than 16 bytes at the end of a long input. In between process 16-byte chunks directly from the input byte array. For 32kB inputs this saves us some 2048 calls to System.arraycopy() copying all those 32kB. The speedup is minimal but noticeable in benchmarking. Bug: apache#524
Unroll the permutation loop of ChaCha20 and simplify the crypt() method to work on bytes instead of ints. Unroll and optimize the pack/unpack methods in ChaCha20 and in Poly1305. Don't use Integer.toUnsignedLong(): avoid the extra method call by doing (int & 0xFFFF_FFFFL) inline. In Poly1305.processBlock(), use ints instead of longs where possible (variables t0-t3). All this brings a speed-up of about 40% for the encryption/decryption. Of the time spent in ChaCha20-Poly1305 about one third is spent in the ChaChaEngine and two thirds in Poly1305.processBlock(). Bug: apache#524
For ByteArrayBuffer, optimize the getInt()/putInt() operations to work directly in the buffer without copying into/from a local work area. Simplify some of the operations; in particular, avoid extending bytes to long and then shifting. Bug: apache#524
In ChannelAsyncOutputStream, do not create a new SSH packet if the caller already had passed in an SSH packet buffer, and we're writing it fully. This avoids allocating a new buffer, which gives GC less to do, and it also avoid needlessly copying the data around. Instead the prepared SSH packet buffer is encoded and written directly. Bug: apache#524
Give SftpOutPutStreamAsync a transferFrom operation to be able to read directly into a prepared SSH packet uffer. Previously, we always had to read into a general byte buffer, and then copy into the SSH packet. This extra data copy can be avoided. If the copy buffer size is such that the data fits into a single SSH packet, then this packet buffer can be encoded and sent directly without any further data copying. Simplify the implementation: since the stream always uses a PacketBuffer the code paths for other general byte buffers can simply be removed. Also enable reading the next buffer while the previous one is being sent. Use two buffers alternatingly; one being sent, the other being filled. Use this implementation also in SftpRemotePathChannel in its transferFrom() implementation. _Do_ close the stream there, otherwise the final ACKs may not be checked. Bug: apache#524
Bouncy Castle apparently only has native implementations in its "LTS" releases. BC 1.78.1 has none. SunJCE uses native code. The "security registrar" architecture in Apache MINA sshd prefers registered providers over the platform providers, and it automatically registers Bouncy Castle if it's present. So with Bouncy Castle on the classpath and an SSH connection for which an AES cipher was specified, Apache MINA sshd would use the much slower Bouncy Castle AES. Add a new "SunJCEWrapper" registrar that brings the SunJCE AES and HmacSHA* to the front of the list, so that they are used even if Bouncy Castle is also registered. The new registrar can be disabled via a system property as usual, and it is only enabled if the JVM indeed has a "SunJCE" security provider registered. See also https://issues.apache.org/jira/browse/SSHD-719 (issue closed as "not needed"). Bug: apache#524
The update() implementation copied _all_ bytes (successively) first into an internal 16-byte buffer and then processed that buffer. This is no needed if the input is long. Use the internal 16-byte buffer only for inputs shorter than 16 bytes, or if there is a leftover of less than 16 bytes at the end of a long input. In between process 16-byte chunks directly from the input byte array. For 32kB inputs this saves us some 2048 calls to System.arraycopy() copying all those 32kB. The speedup is minimal but noticeable in benchmarking. Bug: apache#524
What ciphers were used? If you compare JSch with native AES against MINA sshd with ChaCha20-Poly1305, 10% is about to be expected. |
Unroll the permutation loop of ChaCha20 and simplify the crypt() method to work on bytes instead of ints. Unroll and optimize the pack/unpack methods in ChaCha20 and in Poly1305. Don't use Integer.toUnsignedLong(): avoid the extra method call by doing (int & 0xFFFF_FFFFL) inline. In Poly1305.processBlock(), use ints instead of longs where possible (variables t0-t3). All this brings a speed-up of about 40% for the encryption/decryption. Of the time spent in ChaCha20-Poly1305 about one third is spent in the ChaChaEngine and two thirds in Poly1305.processBlock(). Bug: apache#524
For ByteArrayBuffer, optimize the getInt()/putInt() operations to work directly in the buffer without copying into/from a local work area. Simplify some of the operations; in particular, avoid extending bytes to long and then shifting. Bug: apache#524
In ChannelAsyncOutputStream, do not create a new SSH packet if the caller already had passed in an SSH packet buffer, and we're writing it fully. This avoids allocating a new buffer, which gives GC less to do, and it also avoid needlessly copying the data around. Instead the prepared SSH packet buffer is encoded and written directly. Bug: apache#524
Give SftpOutPutStreamAsync a transferFrom operation to be able to read directly into a prepared SSH packet uffer. Previously, we always had to read into a general byte buffer, and then copy into the SSH packet. This extra data copy can be avoided. If the copy buffer size is such that the data fits into a single SSH packet, then this packet buffer can be encoded and sent directly without any further data copying. Simplify the implementation: since the stream always uses a PacketBuffer the code paths for other general byte buffers can simply be removed. Also enable reading the next buffer while the previous one is being sent. Use two buffers alternatingly; one being sent, the other being filled. Use this implementation also in SftpRemotePathChannel in its transferFrom() implementation. _Do_ close the stream there, otherwise the final ACKs may not be checked. Bug: apache#524
Bouncy Castle apparently only has native implementations in its "LTS" releases. BC 1.78.1 has none. SunJCE uses native code. The "security registrar" architecture in Apache MINA sshd prefers registered providers over the platform providers, and it automatically registers Bouncy Castle if it's present. So with Bouncy Castle on the classpath and an SSH connection for which an AES cipher was specified, Apache MINA sshd would use the much slower Bouncy Castle AES. Add a new "SunJCEWrapper" registrar that brings the SunJCE AES and HmacSHA* to the front of the list, so that they are used even if Bouncy Castle is also registered. The new registrar can be disabled via a system property as usual, and it is only enabled if the JVM indeed has a "SunJCE" security provider registered. See also https://issues.apache.org/jira/browse/SSHD-719 (issue closed as "not needed"). Bug: apache#524
The update() implementation copied _all_ bytes (successively) first into an internal 16-byte buffer and then processed that buffer. This is no needed if the input is long. Use the internal 16-byte buffer only for inputs shorter than 16 bytes, or if there is a leftover of less than 16 bytes at the end of a long input. In between process 16-byte chunks directly from the input byte array. For 32kB inputs this saves us some 2048 calls to System.arraycopy() copying all those 32kB. The speedup is minimal but noticeable in benchmarking. Bug: apache#524
In the ChaChaEngine, exploit peculiarities of its use in SSH: the counter is actually 32bit and never overflows, and the nonce is the SSH packet sequence number, also 32 bits, and wraps on overflow. As a result two of the ints of the engine state are always zero, and the handling of nonce and counter can be slightly simplified. In Poly1305Mac, inline the long multiplications. Avoid extensions from int to long for the precomputed values (r, s, k); store them as longs up front. For h, reduce the number of extensions from 25 to 5 by doing it once before the multiplications. As a side effect this part of the code also is nicer to read.
Streamline ACK checking a bit. Do a quick check for the OK case first and do the full and proper parsing only if the status is not OK. Also avoid getting the IDLE_TIMEOUT repeatedly when draining remaining ACKs when SftpOutputStreamAsync is closed.
The new maven project sshd-benchmarks contains JMH benchmarks. The project is not part of the binary distribution, and the install:install and deploy:deploy targets are skipped. The project is included in the source distribution. The benchmarks are intended to be run locally. Currently the project contains benchmarks for SFTP file uploads, which can be run either against a local container (if the docker engine is running), or against an external SFTP server. The test dependencies to JUnit and Mockito had to moved from the top-level POM to the individual project POMs. Having the JUnit dependency in the JMH sshd-benchmarks project causes compilation errors in Eclipse.
Single data points from single manually timed runs don't help anyone. In #579 I've published the JMH benchmarks I've been using. I do not see any significant speed differences in SFTP file uploads between JSch 0.2.18 and Apache MINA 2.14.0-SNAPSHOT (at current master commit 16b3078). If you run these benchmarks and you get significant performance differences, you will have to profile and analyze the problem yourself. It may be specific to your particular environment. If you find something and can improve it: we always welcome PRs. If you report any findings include in addition to the benchmark summary also the JVM version the benchmark was run on. Since I do not see any meaningful speed differences anymore I will only be able to help with further performance problems in simple SFTP uploads if you can provide a setup or test or benchmark that reproduces the problem in my environment. This issue will be closed once #579 is merged; for future problems feel free to open new issues. |
Thank you very much for your help. At present, the performance has indeed improved. I will contact you if there are other findings in the future |
The new maven project sshd-benchmarks contains JMH benchmarks. The project is not part of the binary distribution, and the install:install and deploy:deploy targets are skipped. The project is included in the source distribution. The benchmarks are intended to be run locally. Currently the project contains benchmarks for SFTP file uploads, which can be run either against a local container (if the docker engine is running), or against an external SFTP server.
Do we have any timeline by which the changes would delivered? |
@tomaswolf , is there a tentative timeline to deliver this fix at newer library version ? |
Phenomenon:
Uploading and downloading files with mina-sshd is about twice as slow as JSCH
Code:
Mina-sshd
Dependency
demo
JSCH
Dependency
demo
Question
Is there something wrong with my coding? Is there any configuration or correct encoding of mina-sshd so that my sftp function can reach the same speed level as jsch?
The text was updated successfully, but these errors were encountered: