-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
documented chunk size is too large #1
Comments
Joey, Thanks for git-annex and for the feedback. I have actually run into this issue. I had to adjust a VM's memory allocation to accomodate brief periods of 3-5gb of memory usage (during the periods of time that git-annex is preparing a chunk for upload). In retrospect that's probably not a sensible configuration for most users. There are a few issues here:
I think 1MiB is far too small. For an archive with large files, a lot of chunks become required quite quickly which will be a significant performance hit. In my view the happy medium is probably closer to 50MiB or 100MiB - this would be literally 50x or 100x less overhead (at the cost of tens or hundreds of megabytes of ram). In an ideal world I think git-annex would use a combination of variable chunk sizes and pack files to (1) hide the size of files, and (2) optimize interaction with remotes. To use a contemporary example, the Panama papers have been described in the press as as a multi-TB data set with tens/hundreds of thousands of individual files. With git-annex's current design, simply by looking at a user's file chunk sizes, I think it would be relatively trivial to identify users in possession of this dataset - even if the aggregate dataset size was not a match (i.e. they also had other files in their repo). With all that said - do you think a 50MiB documented default might be a better choice than 1MiB? Or are there use cases I'm not adequately considering? |
(If rclone could be used as a library, http connections could be reused to 50MiB sounds like a better choice for git-annex-remote-rclone. But it I do think I could probably make git-annex not buffer the chunks in (I've considered adding padding of small chunks to get all chunks the see shy jo |
It's not really a good idea to set chunk size to 1 gb, because git-annex currently has to buffer a whole chunk of a file in memory. So, that could make git-annex use 1 gb of memory or more.
http://git-annex.branchable.com/chunking/ documents this, and suggests something in the 1 MB range for chunks. Partly due to memory size and partly because this minimizes the amount of redundant data transferred when resuming an upload/download of chunks.
If rclone supports resuming partial uploads and downloads of large files, it might make sense to pick a larger chunk size, since the latter concern wouldn't matter. The memory usage would still make 1 gb too large for chunks.
The text was updated successfully, but these errors were encountered: