documented chunk size is too large #1

joeyh · 2016-04-27T16:32:59Z

It's not really a good idea to set chunk size to 1 gb, because git-annex currently has to buffer a whole chunk of a file in memory. So, that could make git-annex use 1 gb of memory or more.

http://git-annex.branchable.com/chunking/ documents this, and suggests something in the 1 MB range for chunks. Partly due to memory size and partly because this minimizes the amount of redundant data transferred when resuming an upload/download of chunks.

If rclone supports resuming partial uploads and downloads of large files, it might make sense to pick a larger chunk size, since the latter concern wouldn't matter. The memory usage would still make 1 gb too large for chunks.

DanielDent · 2016-04-27T19:19:56Z

Joey,

Thanks for git-annex and for the feedback.

I have actually run into this issue. I had to adjust a VM's memory allocation to accomodate brief periods of 3-5gb of memory usage (during the periods of time that git-annex is preparing a chunk for upload). In retrospect that's probably not a sensible configuration for most users.

There are a few issues here:

rclone is not yet particularly efficient at copying a single file - improve efficiency of copy command for single files rclone/rclone#422 - the authors are aware and are working on a fix.
Even if rclone only required a single POST request for a single chunk, RTTs to setup the TCP connection and TCP slow start would mean a lot of time spent with less than optimal throughput.
During a 'drop', unless the repo is set fully trusted, git-annex is going to want to verify the continued presence of each of the chunks. This means a few RTTs for each chunk.

I think 1MiB is far too small. For an archive with large files, a lot of chunks become required quite quickly which will be a significant performance hit. In my view the happy medium is probably closer to 50MiB or 100MiB - this would be literally 50x or 100x less overhead (at the cost of tens or hundreds of megabytes of ram).

In an ideal world I think git-annex would use a combination of variable chunk sizes and pack files to (1) hide the size of files, and (2) optimize interaction with remotes. To use a contemporary example, the Panama papers have been described in the press as as a multi-TB data set with tens/hundreds of thousands of individual files. With git-annex's current design, simply by looking at a user's file chunk sizes, I think it would be relatively trivial to identify users in possession of this dataset - even if the aggregate dataset size was not a match (i.e. they also had other files in their repo).

With all that said - do you think a 50MiB documented default might be a better choice than 1MiB? Or are there use cases I'm not adequately considering?

joeyh · 2016-04-27T21:16:35Z

(If rclone could be used as a library, http connections could be reused to
avoid TCP slow start. That's what git-annex does for S3 and WebDAV.)

50MiB sounds like a better choice for git-annex-remote-rclone. But it
would be worth mentioning the tradeoffs or linking to
https://git-annex.branchable.com/chunking/

I do think I could probably make git-annex not buffer the chunks in
memory in this case. Opened a todo
https://git-annex.branchable.com/todo/upload_large_chunks_without_buffering_in_memory/

(I've considered adding padding of small chunks to get all chunks the
same size; varying chunk sizes might also obscure total file size some,
but attackers could do many things to correlate related chunks and so
get a good idea of file sizes.)

see shy jo

DanielDent closed this as completed in a5d26bb Apr 27, 2016

fkemmer mentioned this issue Jun 17, 2017

Parallel uploads to hubic cause irritation (probably no bug, just mentioning) #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

documented chunk size is too large #1

documented chunk size is too large #1

joeyh commented Apr 27, 2016

DanielDent commented Apr 27, 2016

joeyh commented Apr 27, 2016

documented chunk size is too large #1

documented chunk size is too large #1

Comments

joeyh commented Apr 27, 2016

DanielDent commented Apr 27, 2016

joeyh commented Apr 27, 2016