Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

documented chunk size is too large #1

Closed
joeyh opened this issue Apr 27, 2016 · 2 comments
Closed

documented chunk size is too large #1

joeyh opened this issue Apr 27, 2016 · 2 comments

Comments

@joeyh
Copy link

joeyh commented Apr 27, 2016

It's not really a good idea to set chunk size to 1 gb, because git-annex currently has to buffer a whole chunk of a file in memory. So, that could make git-annex use 1 gb of memory or more.

http://git-annex.branchable.com/chunking/ documents this, and suggests something in the 1 MB range for chunks. Partly due to memory size and partly because this minimizes the amount of redundant data transferred when resuming an upload/download of chunks.

If rclone supports resuming partial uploads and downloads of large files, it might make sense to pick a larger chunk size, since the latter concern wouldn't matter. The memory usage would still make 1 gb too large for chunks.

@DanielDent
Copy link
Member

Joey,

Thanks for git-annex and for the feedback.

I have actually run into this issue. I had to adjust a VM's memory allocation to accomodate brief periods of 3-5gb of memory usage (during the periods of time that git-annex is preparing a chunk for upload). In retrospect that's probably not a sensible configuration for most users.

There are a few issues here:

  1. rclone is not yet particularly efficient at copying a single file - improve efficiency of copy command for single files rclone/rclone#422 - the authors are aware and are working on a fix.
  2. Even if rclone only required a single POST request for a single chunk, RTTs to setup the TCP connection and TCP slow start would mean a lot of time spent with less than optimal throughput.
  3. During a 'drop', unless the repo is set fully trusted, git-annex is going to want to verify the continued presence of each of the chunks. This means a few RTTs for each chunk.

I think 1MiB is far too small. For an archive with large files, a lot of chunks become required quite quickly which will be a significant performance hit. In my view the happy medium is probably closer to 50MiB or 100MiB - this would be literally 50x or 100x less overhead (at the cost of tens or hundreds of megabytes of ram).

In an ideal world I think git-annex would use a combination of variable chunk sizes and pack files to (1) hide the size of files, and (2) optimize interaction with remotes. To use a contemporary example, the Panama papers have been described in the press as as a multi-TB data set with tens/hundreds of thousands of individual files. With git-annex's current design, simply by looking at a user's file chunk sizes, I think it would be relatively trivial to identify users in possession of this dataset - even if the aggregate dataset size was not a match (i.e. they also had other files in their repo).

With all that said - do you think a 50MiB documented default might be a better choice than 1MiB? Or are there use cases I'm not adequately considering?

@joeyh
Copy link
Author

joeyh commented Apr 27, 2016

(If rclone could be used as a library, http connections could be reused to
avoid TCP slow start. That's what git-annex does for S3 and WebDAV.)

50MiB sounds like a better choice for git-annex-remote-rclone. But it
would be worth mentioning the tradeoffs or linking to
https://git-annex.branchable.com/chunking/

I do think I could probably make git-annex not buffer the chunks in
memory in this case. Opened a todo
https://git-annex.branchable.com/todo/upload_large_chunks_without_buffering_in_memory/

(I've considered adding padding of small chunks to get all chunks the
same size; varying chunk sizes might also obscure total file size some,
but attackers could do many things to correlate related chunks and so
get a good idea of file sizes.)

see shy jo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants