Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add directory transfer support for SFTPOperator #44126

Merged
merged 11 commits into from
Dec 28, 2024
Merged

Conversation

Dawnpool
Copy link
Contributor

@Dawnpool Dawnpool commented Nov 18, 2024

This PR implements directory transfer for SFTPOperator, related to the issue I have raised before.
Currently, the SFTPOperator only accepts file paths, and you have to specify every filename in a folder by list when transferring an entire folder.
By adding some directory handling logic, the operator now can accept directory paths as well, allowing users easily transfer entire folders.

Copy link

boring-cyborg bot commented Nov 18, 2024

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

@kunaljubce
Copy link
Contributor

@Dawnpool You seem to have a whole test non-db test suite failing. Can you check it on your local breeze and see if you can fix it?

@Dawnpool
Copy link
Contributor Author

@Dawnpool You seem to have a whole test non-db test suite failing. Can you check it on your local breeze and see if you can fix it?

Hi, I guess there was an issue with the test code at the time I forked the repository. It might have been resolved by merging the main branch. There is no problem on my local breeze environment for now.

@kunaljubce
Copy link
Contributor

@Dawnpool You seem to have a whole test non-db test suite failing. Can you check it on your local breeze and see if you can fix it?

Hi, I guess there was an issue with the test code at the time I forked the repository. It might have been resolved by merging the main branch. There is no problem on my local breeze environment for now.

Let's wait for one of the maintainers to approve the remaining workflows. That will give you some clarity if rebasing against main fixed it. Feel free to drop a gentle reminder on the Slack channel - #contributors.

@potiuk
Copy link
Member

potiuk commented Nov 28, 2024

Let's wait for one of the maintainers to approve the remaining workflows. That will give you some clarity if rebasing against main fixed it. Feel free to drop a gentle reminder on the Slack channel - #contributors.

Approved workflows.

@potiuk
Copy link
Member

potiuk commented Nov 28, 2024

errrors, errors everywhere :D

@kunaljubce
Copy link
Contributor

@Dawnpool Try reproducing your CI tests locally and figuring out what's going wrong. Seems this is a good place to start - https://github.com/apache/airflow/blob/main/dev/breeze/doc/ci/08_running_ci_locally.md

@eladkal
Copy link
Contributor

eladkal commented Dec 19, 2024

PR has conflicts that needs to be resolved

@Dawnpool
Copy link
Contributor Author

@potiuk
The non-db tests keep failing and it doesn't seem related to the changes I made based on the error logs.
I noticed your comment(#44625 (comment)) on a recent approved PR with the same error, where you mentioned that it was an unrelated intermittent error. Is there an ongoing issue with the non-db tests?

@eladkal
I have resolved the conflicts. Please feel free to check!

@potiuk
Copy link
Member

potiuk commented Dec 23, 2024

I think it's a side effect of bad configuration - likely some initialization issue in some of the skipped provider tests that are missing somewhere. Likely it is mitigated when full tests are run as some other tests are intitializing SQL Alchemy (where we completely should not need it for DB Tests). I will take a look at it later today and try to find out what it is (And it would be great to not merge it till then as we might be eble to see if the problem is fixed when I find it.

@potiuk
Copy link
Member

potiuk commented Dec 27, 2024

Can you please rebase/resolve conflicts and ping me if it fails again.

@Dawnpool
Copy link
Contributor Author

@potiuk
it failed again😥 same error with the non-db tests

@potiuk
Copy link
Member

potiuk commented Dec 27, 2024

😱

@potiuk
Copy link
Member

potiuk commented Dec 27, 2024

I think this should fix it: #45244

@potiuk potiuk requested a review from XD-DENG as a code owner December 27, 2024 13:58
@potiuk potiuk requested a review from ashb as a code owner December 27, 2024 13:58
@potiuk
Copy link
Member

potiuk commented Dec 27, 2024

Let's see - I added the fix on top and we should see if the problem is fixed.

@potiuk
Copy link
Member

potiuk commented Dec 27, 2024

Nope - there is another issue

@potiuk
Copy link
Member

potiuk commented Dec 27, 2024

All right. I think I found it #45249 - very interesting issue (and your change accidentally revealed it because it selectively run non-db microsoft.azure test collection as the first group of tests to run - and it turned out that those tests relied on a side-effect of other tests being run before.

@potiuk
Copy link
Member

potiuk commented Dec 27, 2024

OK. The fix is merged. You can rebase again and I 🤞 that it should work now.

@potiuk potiuk merged commit 51fea3e into apache:main Dec 28, 2024
64 checks passed
Copy link

boring-cyborg bot commented Dec 28, 2024

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

@potiuk
Copy link
Member

potiuk commented Dec 28, 2024

Finally!

@Dawnpool
Copy link
Contributor Author

Finally! Thank you all😁

jason810496 pushed a commit to jason810496/airflow that referenced this pull request Dec 28, 2024
* Add directory put and get functions for sftp provider

* Add test code

* Add directory exists check

* Fix merge conflict

* Add path exists check
@Dev-iL
Copy link
Contributor

Dev-iL commented Dec 28, 2024

Thank you for your contribution @Dawnpool!

In this implementation, files are downloaded sequentially, right? If so - when downloading folders containing a large number of small files, this approach will be quite inefficient. I think we should consider expanding the method, or adding another one that downloads files concurrently.

Possibly related to SFTPHookAsync.

@Dawnpool
Copy link
Contributor Author

Dawnpool commented Dec 31, 2024

Hi @Dev-iL!
Yes, just as you said, files are downloaded sequentially since it simply uses multiple calls to the retrieve_file function in the order file names are collected.
I totally agree with your point this approach would be inefficient in the case you said.
It would be awesome if we could download multiple files concurrently, but I'm not sure what kind of approach to take (probably multi-threading?) and we should definitely do performance checks to compare after implementing the method.

LefterisXefteris pushed a commit to LefterisXefteris/airflow that referenced this pull request Jan 5, 2025
* Add directory put and get functions for sftp provider

* Add test code

* Add directory exists check

* Fix merge conflict

* Add path exists check
agupta01 pushed a commit to agupta01/airflow that referenced this pull request Jan 6, 2025
* Add directory put and get functions for sftp provider

* Add test code

* Add directory exists check

* Fix merge conflict

* Add path exists check
HariGS-DB pushed a commit to HariGS-DB/airflow that referenced this pull request Jan 16, 2025
* Add directory put and get functions for sftp provider

* Add test code

* Add directory exists check

* Fix merge conflict

* Add path exists check
got686-yandex pushed a commit to got686-yandex/airflow that referenced this pull request Jan 30, 2025
* Add directory put and get functions for sftp provider

* Add test code

* Add directory exists check

* Fix merge conflict

* Add path exists check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants