Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solve issue #282: Update FSSpecFileLister, IoPathFileLister #383

Closed
wants to merge 5 commits into from

Conversation

ykabusalah
Copy link
Contributor

Fixes #282

Summary

Updated FSSpecFileLister and IoPathFileLister to optionally take a list of strings as root paths. As per the contributors guideline, added tests to the testsuite to cover changes and ran the pre-commit format checker.

Changes

  • Updated FSSpecFileLister to take either a str or sequence of strings as root path argument (in _init_ arguments)

  • Updated _init_ of FSSpecFileLister to check if root argument is string or list and make into IterDataPipe

  • Updated _iter_ of FSSPecFileLister to look through all root paths when yielding files

  • Added unit test in test_fsspec.py to test for multiple root path input to FSSpecFileLister

  • Updated IoPathFileLister to take either a str or sequence of strings as root path argument (in _init_ arguments)

  • Updated _init_ of IoPathFileLister to check if root argument is string or list and make into IterDataPipe

  • Updated _iter_ of IoPathFileLister to look through all root paths when yielding files

  • Added unit test in test_local_io.py to test for multiple root path input to IoPathFileLister

@facebook-github-bot
Copy link
Contributor

Hi @ykabusalah!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 4, 2022
@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@ykabusalah
Copy link
Contributor Author

@NivekT one test failed for macos build but I see that it has failed on previous pull requests as well. Let me know if there is anything I need to do for this.

Thanks.

@NivekT
Copy link
Contributor

NivekT commented May 4, 2022

@NivekT one test failed for macos build but I see that it has failed on previous pull requests as well. Let me know if there is anything I need to do for this.

Thanks.

Don't worry about that one. I will have a look at the rest of your PR soon. Thanks!

@ykabusalah
Copy link
Contributor Author

@NivekT one test failed for macos build but I see that it has failed on previous pull requests as well. Let me know if there is anything I need to do for this.
Thanks.

Don't worry about that one. I will have a look at the rest of your PR soon. Thanks!

Awesome. Sounds good. Any chance you'd be able to take a look at it by tomorrow?

@facebook-github-bot
Copy link
Contributor

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this patch.

yield fs.protocol + "://" + abs_path
for root in self.datapipe:
fs, path = fsspec.core.url_to_fs(root)
is_local = fs.protocol == "file" or not root.startswith(fs.protocol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind solving #348 as well?
You can basically check if fs.protocol is a list. If it is, we can run any(root.startswith(protocol) for protocol in fs.protocol)

continue

# ensure the file name has the full fsspec protocol path
if file_name.startswith(fs.protocol):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Comment on lines 87 to 88
if root.startswith(fs.protocol):
yield fs.protocol + "://" + abs_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here might need to be changed as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what way should this be changed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @ejguan is referring to his other comment here:
#383 (comment)

It will be great if you can solve #348 as well in this PR, and when you do, the logic here will have to change as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if root.startswith(fs.protocol):
yield fs.protocol + "://" + abs_path
if isinstance(fs.protocol, list):
for protocol in fs.protocol:
if root.startswith(protocol):
yield protocol + "://" + abs_path
break
elif root.startswith(fs.protocol):
yield fs.protocol + "://" + abs_path

def test_io_path_file_lister_iterdatapipe_with_list(self):
datapipe = IoPathFileLister(root=[self.temp_sub_dir.name, self.temp_sub_dir_2.name])

# check all file paths within sub_folder are listed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: similar to my other comment about comparing sorted list instead

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting the PR! Just a few small comments.

@ykabusalah
Copy link
Contributor Author

@NivekT @ejguan

This fixes issue #348 and updates Testing Logic

Summary

  • Updated FSSpecFileLister to check if protocol for root is a string or list. S3 returns a list, so this handles that return and updates logic.

  • Updated IO and FSSpec unit tests to sort and compare lists rather than looping and checking

Changes

  • Updated FSSpecFileLister to check if fs protocol is a list or a string

  • If it is a string, it puts it into a list and handles the rest of the logic of the FileLister as if it were a list

  • Uses Python's "any()" functionality to check conditions against the protocol list

  • Check the beginning of the root against all protocols and returns if it is a match, else returns the absolute path

  • Updated IO and FSSpec unit tests to sort and compare lists rather than looping and checking

@facebook-github-bot
Copy link
Contributor

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good to me in general. Two more comments below.

else:
protocol_list = fs.protocol

is_local = fs.protocol == "file" or not any(root.startswith(protocol) for protocol in fs.protocol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean protocol_list, right?

… in a for loop to stop yielding values when one we wanted is found
@ykabusalah
Copy link
Contributor Author

@ejguan @NivekT

This new commit should address the remaining outstanding concerns: added a variable change to make it congruent with the rest of the changes as well as a break in a for loop to stop yielding values when we found the one we wanted.

Thanks.

@facebook-github-bot
Copy link
Contributor

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for working on this PR!

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing these issues! LGTM with two nit comments below.

@facebook-github-bot
Copy link
Contributor

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend FSSpecFileLister and IoPathFileLister to support loading multiple directories
4 participants