-
-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: allow symlinkinkg and hardlinking files instead of just copying #409
base: main
Are you sure you want to change the base?
Conversation
This allows users to manage collections of large WARC files without duplicating space. Hardlinks are used instead of symlinks to reflect the original mechanism, where the file is copied (so it can be safely removed from the source). If we used symlinks, we would break that expectation which could lead to data loss. Inversely, hardlinks can lead to data loss as well. For example, pywb could somehow edit the file, which would modify the original as well. But we assume here pywb does not modify the file, and each side of the hardlink can have their own permissions to ensure this (or not) as well. Closes: webrecorder#408
This is WIP because I haven't worked on the docs or tests yet, as I want feedback on the idea first. Furthermore, tests fail here but that's unrelated to the patch here: they've been failing on master since at least 08b0ac8 Advice on which docs to update and insights on the test suite would be very much welcome as well. |
Hey, if your WARC files are already present somewhere else in a structured way, you might have success with configuring multiple archive paths, see https://pywb.readthedocs.io/en/latest/manual/configuring.html#archive-paths |
The problem with that approach is that this expects a certain layout in the filesystem. Right now WARC files are stored like this:
Where:
This layout is commonly used by grab-site and archivebot and other crawlers. It does not match the expectations of pywb, including custom archive paths, which still look in So I don't think that's a sufficient solution to my problem. |
I can't speak for the technical implementation (although it looks good to me), but I'd definitely appreciate anything that allows using existing WARCs – whatever file structure they may be in – in pywb without having to make any copies. Collections of WARCs can be huge, and storing them twice is wasteful and in some cases not even possible due to space constraints. Of course, it would be possible to create a separate directory with the expected structure, make hardlinks or symlinks in there, and then use that as another archive path. But from a usability perspective, it would be much better in my opinion if pywb/wb-manager simply had an option that can take care of that directly (like the one proposed in this PR) instead of having to do that manually or with helper scripts. |
@anarcat thanks for the PR. I believe the appropriate docs section that would need updating is Still thinking about how to test this because we will also need to support this functionality on Windows. It may also be a good idea to add moving WARCs to the expected locations. /cc @ikreymer |
Thanks for suggesting this, I agree that this should be supported, but not sure that symlinking/hardlinking is the way to go. As @N0taN3rd, this would complicate Windows support and would potentially make the setup more brittle. pywb is close to supporting what you want with external paths, but unfortunately its not automated. It seems like the best option would be to support a per-collection overrides, say collections/external-data/overrides.yaml:
Then, instead of the local Of course, the overrides.yaml could also be managed with something like:
If you wanted to add only a specific WARC file in a directory instead of all WARC, that too can be supported by specifying the file path instead of a directory (eg. |
Where would the CDX files be stored with that setup? Regarding Windows support, is that a hard requirement or could that feature simply be unavailable if not supported by the OS, the Python version, or the configuration (only admins can create symlinks on Windows according to the Python docs)? |
By default, still in the
Ideally, the external directory feature would be available on all platforms. |
what I hear from the various comments so far is this:
I understand where you are coming from: multi-platform compatibility is important, and there are existing features which might fit this requirement. however, i would argue that "a bird in the mouth is better than two in the bush": I have a working patch to workaround a real scalability issue with pywb, right now. it might not work that effectively on Windows, but I want to point out that both the proposals to automate editing of the YAML file seem to be an entirely different approach, one that would require much more changes to the documentation and seem to me like feature creep. i just want to copy files lightly in the archive, not redesign how the entire YAML configuration system works! :) if this is the approach you want to take, I'm not sure I can help since I would need to dive again deeper in the internals of pywb, which might mean this would never be done at all. ;) so to move this forward, I would propose that we keep on following the approach I proposed here. this would mean adding tests for the functionality and documentation. i would be happy to push that forward, if the proposal is accepted, otherwise I'm afraid I won't be able to provide a solution to #408 myself going forward. have a nice day! PS: the travis test failure here does not seem related to the patch, you might want to look into that... i'll re-trigger the build to see if it works better now. |
Codecov Report
@@ Coverage Diff @@
## master #409 +/- ##
==========================================
- Coverage 88.04% 87.89% -0.16%
==========================================
Files 59 59
Lines 7227 7235 +8
Branches 1286 1288 +2
==========================================
- Hits 6363 6359 -4
- Misses 570 575 +5
- Partials 294 301 +7
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +1 to adding this.
hi @anarcat apologies for the delay -- was away on vacation and neglected to respond earlier! I reread your comment and thought about it more, and have considered our current work. Since we don't have the bandwidth to implement the config-based approach now, and it could still be done at a later time, I think you're right and we should add this solution, even if it is not cross-platform as it will help your (and possibly others') use cases. This solution is a simple change to the wb-manager while the config option would be a much more extensive change, as you've mentioned. To proceed, could you add:
And we'll try to merge it in for next release! Thanks again! (Yes, the travis-ci issue is/was unrelated, we're looking at that) |
awesome! i'm not sure I'll have time to do this before the next year (and I'm don't mind at all if someone else beats me to it), but hopefully I'll be able to come back to this soon-ish. |
Any progress with this? It's something I'm finding problematic at the moment. |
Description
This allows users to manage collections of large WARC files without duplicating space. Hardlinks are used instead of symlinks to reflect the original mechanism, where the file is copied (so it can be safely removed from the source). If we used symlinks, we would break that expectation which could lead to data loss.
Inversely, hardlinks can lead to data loss as well. For example, pywb could somehow edit the file, which would modify the original as well. But we assume here pywb does not modify the file, and each side of the hardlink can have their own permissions to ensure this (or not) as well.
Closes: #408
Types of changes
Checklist: