Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sig collect should resolve to individual files when loading a pathlist or a directory #3039

Closed
ctb opened this issue Feb 24, 2024 · 4 comments · Fixed by #3027
Closed

sig collect should resolve to individual files when loading a pathlist or a directory #3039

ctb opened this issue Feb 24, 2024 · 4 comments · Fixed by #3027

Comments

@ctb
Copy link
Contributor

ctb commented Feb 24, 2024

suppose you have a pathlist containing a bunch of zip files, and you run sig collect directly on the pathlist:

sourmash sig collect podar-ref-zip-list.txt -o podar-ref-zip-list.mf.csv -F csv

you end up with a manifest CSV file that has podar-ref-zip-list.txt for all of the internal_location values.

if, instead, you do:

sourmash sig collect $(cat podar-ref-zip-list.txt) -o podar-ref-zip-list.mf.csv -F csv

you end up with the individual files going into the internal_location values, which is preferable.

this is because sig collect overwrites the "true" location of sketches with the location they were loaded from on the command line. This is appropriate in many other situations but is not great when loading from directories or pathlists or (I suspect) standalone manifests.

related issues:

@ctb ctb changed the title sig collect should do resolve to individual files when loading a pathlist or a directory sig collect should resolve to individual files when loading a pathlist or a directory Feb 24, 2024
@ctb
Copy link
Contributor Author

ctb commented Feb 24, 2024

the challenge is that we don't want to override the location for when it is, in fact, not something directly resolvable - like when the internal_location is actually pointing at something inside a zip file.

or, to frame it in the terms @luizirber proposed over in #3008 (comment): we probably want to resolve to individual files when loading things from a file system storage, but not from a zip storage.

@ctb
Copy link
Contributor Author

ctb commented Feb 24, 2024

perhaps a different way do this is to allow Index classes to specify whether their locations are resolvable at the file system level or not?

@ctb
Copy link
Contributor Author

ctb commented Feb 24, 2024

alternatively, support a different 'location' which is 'lowest possible file system resolvable path'...

@ctb
Copy link
Contributor Author

ctb commented Mar 6, 2024

acshually... I think we should just advise people not to use pathlists and directories, and instead have them use --from-file if they need to pass in a lot of filenames. We should update the sig collect and sig check docs re this.

I don't think we should complicate the MultiIndex API to have it expose the filenames in any particularly clever way. That sounds complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant