-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SFTP hook's directory transfer to use a single connection for multiple files #46582
Improve SFTP hook's directory transfer to use a single connection for multiple files #46582
Conversation
I also added the Hi @dabla , I think this might be related to your recent PRs. Please feel free to take a look. |
@Dawnpool really good catch, as indeed this will speed up the store_directory and retrieve_directory functions as only one connection will be created, but personally, and that's a personal opinion I don't like the wrapper that much even though it solves the performance issue. Wouldn't it be possible to refactor the get_conn method so that it caches the connection so when invoked a second time it just returns the cached instance and once it's finished removed the cached connection, then the code using the context manager stays as it was. This is how it could be implemented, I've tested it locally and it works:
The hook would have 3 new fields, namely self._ssh_conn, self._sftp_conn and self._conn_count. The for example in retrieve_directory:
Of course if we could replace the "dummy" usage of the self.get_conn() context manager as a decorator, then we would have best of both worlds and only annotate methods where we want that behaviour without changing the methods signature. To be clear if it's not possible, then I would stick with your proposed solution. WDYT? |
Hi @dabla , |
Hi, @dabla |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, looks good to me!
Nice, but can you please add some unit tests testing the behaviour (identity of the client). |
Good point @potiuk. You could indeed add some assertions within the existing tests to make sure connection gets opened/closed only once during the whole operation. |
file_path = os.path.join(root, file_name) | ||
new_remote_path = os.path.join(remote_full_path, os.path.relpath(file_path, local_full_path)) | ||
self.store_file(new_remote_path, file_path, confirm) | ||
with self.get_conn(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh no, I didn't realize it. I'll create a fix PR right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries I forgot to warn you as we had to fix backward compatibility yesterday, my bad
This PR improves SFTP hook's
store_directory
andretrieve_directory
functions to use a single connection when transferring a directory with multiple files.Previously, these functions relied on
store_file
andretrieve_file
functions. And with this PR, thestore_file
andretrieve_file
functions were modified to open and close sftp connection each time.This leads to the
store_directory
andretrieve_directory
functions to open and close too many connections repeatedly when there are many files in a directory, which causes significant overhead.To address this, I modified them to open a connection, transfer all files in a directory, and then close the connection afterward.
I also did a performance test in my local environment by transferring a directory containing 1,000 small files. This reduced the transfer time by approximately 8-9 seconds. The results are shown below.