pack-generation MVP #67

Byron · 2021-04-09T02:03:33Z

gen-pack like plumbing command

Generate a pack using some selection of commits or possibly objects. Drives different kinds of iteration as well as ways of building a pack.

Progress

Command-lines

create a full pack fast, like clone

cargo build  --release --no-default-features --features max,cache-efficiency-debug --bin gix && /usr/bin/time -lp ./target/release/gix -v free pack create -r  tests/fixtures/repos/rust.git --statistics  --thin -e tree-traversal --pack-cache-size-mb 200 --object-cache-size-mb 100  HEAD --nondeterministic-count

create a partial pack for fetches

cargo build  --release --no-default-features --features max,cache-efficiency-debug --bin gix && /usr/bin/time -lp ./target/release/gix -v free pack create -r  tests/fixtures/repos/rust.git --statistics  --thin -e tree-diff --pack-cache-size-mb 400 --object-cache-size-mb 100  < tests/fixtures/repos/rust.git/first-thousand.commits

create a pack with git

echo HEAD | /usr/bin/time -lp  git -C .  pack-objects --all-progress --stdout --revs >/dev/null

Out of scope

a prototype of a server side sending a pack with only the objects the other side doesn't have.
reuse existing deltas - doing so would allow to save a lot of time and avoids the need for implementing actual delta compression. One would only have to get the order of entries right to assure consistency. Without that it's not really usable.
a way to write an index file at the same time, ideally so that it's entirely separate there there isn't always a need
A gixp subcommand to make the functionality available for stress testing and performance testing
*Machinery to produce object deltas.

User Stories

I have a system image, which contains potentially 10k to 100k blobs taking up a total of 100M-1G, many of them binary (such as executables and libraries), and I don't know whether the server has any of those blobs or not. I want to turn that into a standalone commit (typically with no parents), push that commit as a compressed pack while (easier) transferring no blobs or trees the server already has, (medium) delta-compressing blobs reasonably against each other, and (hard) delta-compressing any blobs or trees vs the most similar ones the server does have to make a thin-pack. If the server doesn't have anything useful I want to recognize that quickly and just push a reasonably compressed full pack. Metrics I care about: the server spending as little memory as possible incorporating the pack into its repository (to be immediately usable), the transfer being limited only by bandwidth on either fast (100Mbps) or slow (1Mbps) connections to the server, and getting decent compression and delta-compression to help with the slow-connection case.

The user is working in a git repository, such as the Linux kernel repository. As fast as possible, I want to figure out the changes from their most recent commit (ignoring the index), create a new commit with their current commit as a parent, and then push that commit to the server (along with whatever the server doesn't have). Same easy/medium/hard as above, same metrics as above.

Interesting

git garbage collection and cruft-files (github article)

Progress

Performance Opportunities

Need to decompress entries to find their length - ~~it would save a lot of time if we would know where the next entry begins.~~
- The index has that information but only if we build a sorted vec of all offsets. How would that fit in? Maybe a high-performance trait with more complex call signatures to allow building caches?
- No, this doesn't make a difference at all interestingly, maybe 1.5s of 48 at most. Implemented in ad6d007
  https://github.com/Byron/gitoxide/blob/34b6a2e94949b24bf0bbaeb169b4baa0fa45c965/git-odb/src/linked/find.rs#L52

Notes

packs in general

on demand loading of packs. Same story for libgit2
alternates should be expanded to a list of object repositories that don't know alternates. Related to Remove need for Polonius borrow checker in git_odb::compound::Db::locate(…) #66 .
As they keep the amount of stored objects in the header, immediate streaming is limited by knowing that in time. https://github.com/Byron/gitoxide/blob/ad04ad3b8ac54e78bee307dd44c85c1389edced2/git-odb/src/pack/data/init.rs#L32-L33. Streaming can only start once all objects to send have been discovered.

Pack generation

They use a list of objects to handle along with workstealing among threads.
- ~~These threads regularly pause their work to allow the stealing to happen safely which might be why that doesn't scale?~~(actually that one does seem to scale at least with the amount of core I have, it's pack resolution that doesn't scale)
probably a good idea to not try to find deltas for 'big' objects and make the threshold configurable.

The text was updated successfully, but these errors were encountered:

Byron · 2021-04-09T02:05:31Z

Related to #53

Byron · 2021-04-18T10:30:03Z

In pursuit of better control pack generation and also pave the way for improved async integration, I figured having an Iterator interface would be a good idea.

Now it's possible to step through parallel computations.

However, the respective implementation has to expose unsafe due to the use of a scoped thread which exposes its join handle that can thus be leaked.

https://github.com/Byron/gitoxide/blob/15e47480054d9a517c28f47db3b5fa87968a307e/git-features/src/parallel/in_parallel.rs#L96

Depending on where this is exposed, unsafe might bubble up even further - after all, anything that holds the SteppedReduce can also leak it.

My intuition is to stop bubbling this up beyond git-features just to keep it practical, even though technically it's incorrect. What do you think, @joshtriplett?

joshtriplett · 2021-04-18T16:28:39Z

It's a CPU-intensive operation; my first instinct would be to run it normally and use unblock or similar to run it on a blocking thread.

Trying to structure the computation so that it happens incrementally seems incredibly painful. And in particular, trying to adapt an operation that happens in a thread to happen incrementally seems like it's incurring all the pain of async without any language support for async.

I would suggest building the initial MVP in a synchronous fashion, on the theory that it can still be run in a background thread and controlled via an async mechanism.

I definitely don't think it's OK to use a scoped thread and hide the unsafe, if the unsafe isn't truly encapsulated to the point that you can't do anything unsound with the interface.

joshtriplett · 2021-04-18T16:41:25Z

One other thought on that front: compared to the cost of generating a pack, one or two allocations to set up things like channels or Arc will not be an issue.

Byron · 2021-04-19T00:30:05Z

Thanks for sharing. The main motivator for using scoped threads is to allow standard stack based operation without any wrapping - it's not at all about allocations, merely about usability and the least surprising behaviour. Truth to be told, I cannot currently imagine how traversal will play into static threads when arcs are involved especially along with traits representing an Object (an attempt to allow things like traversing directory trees).

What I take away is the following

let's not hide unsafe unless it's encapsulated
let's not make step-wise computation or extreme async friendliness a requirement for the MVP if something like blocking would work, too.

I hope to overcome my 'writers block' and just write the missing bits to be able to see through the whole operation and play with the parts more until I find a version of the API that feels right.

Byron · 2021-04-19T09:44:18Z

The unsafe is now legitimately gone due to the usage of standard 'static threads. I could assure myself that the mechanism still works even with Arc's involved, despite being a little more difficult to use on the call site. Callers will need to prepare a little more to start the procedure, which is probably acceptable given how long it runs and how 'important' it is.

let's not make step-wise computation or extreme async friendliness a requirement for the MVP if something like blocking would work, too.

This capability probably doesn't have to be removed just yet as machinery itself is exactly the same as already used in in_parallel(), except that now there is more control on the call site. This comes at the cost of having to deal with Arc for the object database, and of course that the API now has yet another way to call it. Those who don't need fine-grained control will not get the best experience that way.

However, it's possible to eventually provide a non-static variant of pack generation too which works similar to pack verification (it uses the non-static version of the machinery) by factoring out the parts that are similar.

Another argument for trying hard to make pack generation play well in an async context certainly is that it commonly happens as part of network interactions like uploading a pack. Right now much of it is a little hypothetical as actual code to prove it works nicely doesn't actually exist, but I am confident it will work as envisioned.

Finally, since both machines, static and non-static are the same at core it should always be possible to return to the non-static one at very low cost should everything else fail.

Byron · 2021-04-19T09:55:27Z

On another note: I am also thinking backpressure and and back-communication. Backpressure is already present as threads will block once the results channel is full. Back-communication should also be possible if the handed-in closures get access to another synchronized channel of sorts to tell them when to deliver the pack entries they have been working on. Such an algorithm would continuously work (probably until it can't meaningfully improve the deltas) until it is told to deliver what's there right now before continuing. Such a message could then be delivered in the moment somebody actually calls .next() on the iterator, which in turn will be based on how fast data can be written to the output (sync or async).

Even though the MVP will not do back-communication I don't see why it shouldn't be possible to implement it. What's neat is that no matter how the machinery operates, in the moment the iterator is dropped it will (potentially with some delay), stop working automatically.

Byron · 2021-04-25T08:01:49Z

The first break-through: pack files (base object only) can now be written from object ids.

…and it's nice to see that overall, it's very flexible even though it may require acrobatics to get it in and out of Easy mode.

…we could also double-check the produced crc32 values, but assigning consecutive numbers seems like it's doing the job.

…ing… (#67) …which is what happens when counting objects for fetches where only changed objects should be sent.

…to avoid dealing with missing objects. It's still a good idea to handle these gracefully though, git itself seems to ignore them.

…for about 10% of performance, speeding up these lookups just a little bit.

…by implementing the few bits that are needed ourselves. This might open up future optimizations, if they matter. Controlling the nom parsers on a token by token basis via iterators probably already saves a lot of time, and these are used everywhere.

…ashes (#67)

…yte strings (#67)

Intead, share it by reference, it's sync after all. This issue was introduced when switching to a `Send + Clone` model, instead of `Send + Sync`, to allow thread-local caches in database handles of all kinds.

…nting (#67) The efficiency of multi-threaded counting is low per core, and despite some speedups might be desirable, one might not want to commit all cores to this amount of waste.

Byron · 2022-11-28T11:41:13Z

About opportunities for performance improvements

@pascalkuthe I have created quick profile from running cargo build --release --no-default-features --features max,cache-efficiency-debug --bin gix && /usr/bin/time -lp ./target/release/gix -v free pack create -r ../../../git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux --statistics --thin -e tree-traversal --pack-cache-size-mb 200 --object-cache-size-mb 100 HEAD (single-threaded), and here is the result:

My takeaways are as follows:

indeed, in single-threaded mode the hashset performance doesn't seem to be the issue
a little less than half the time of the counting phase is spent with getting objects from the object database
a lot of time is spent parsing trees, and memchr seems particularly hot. It's all about finding a null-byte here and I wonder if this can be any faster though.
when digging deeper, it shows that many of these ~5s buckets end up spending most of their time in the object database
and we have 3 seconds in hashbown::set::HashSet::insert()

This is just a quick summary, and right now I am missing a dataset to compare git with gixof various repos of different sizes to understand the size of the performance gap in single-threaded mode. From there it might be possible to figure out what to focus on.

While at it, profiling git might also be useful, which (I think) I did in the past as well. Unfortunately my memory (as well as the notes about this here) are spotty.

Byron added a commit that referenced this issue Apr 15, 2021

A sketch of the pack::generation function signature (#67)

21b0aab

Byron added a commit that referenced this issue Apr 25, 2021

The very first version of complete pack writing (#67)

4d76d53

Byron mentioned this issue May 8, 2021

Implementing pack-gen #42

Closed

3 tasks

Byron added a commit that referenced this issue Sep 21, 2021

Use Easy in the one spot where it is possible… (#67)

6a97bfa

…and it's nice to see that overall, it's very flexible even though it may require acrobatics to get it in and out of Easy mode.

Byron added a commit that referenced this issue Sep 21, 2021

Assure pack-ids are actually unique, the simple way…(#67)

0509b4f

…we could also double-check the produced crc32 values, but assigning consecutive numbers seems like it's doing the job.

Byron added a commit that referenced this issue Sep 21, 2021

Count ref-deltas in thin packs as well (#67)

80c6994

Byron added a commit that referenced this issue Sep 21, 2021

feat: dynamically sized full-object speeds up diff-based object count…

d6c44e6

…ing… (#67) …which is what happens when counting objects for fetches where only changed objects should be sent.

Byron added a commit that referenced this issue Sep 21, 2021

refactor: remove object cache impl which now lives in git-pack (#67)

741558d

Byron added a commit that referenced this issue Sep 21, 2021

feat: object cache size is configurable (#67)

5a8c2da

Byron added a commit that referenced this issue Sep 21, 2021

feat: cache::Object trait for caching and retrieving whole objects (#67)

50cf610

Byron added a commit that referenced this issue Sep 21, 2021

refactor: use new git_pack::cache::Object trait (#67)

b209da2

Byron added a commit that referenced this issue Sep 21, 2021

refactor: split data::output::count::objects into files(#67)

8fe4612

Byron added a commit that referenced this issue Sep 21, 2021

refactor: Use 'cache::Object' trait where it matters (#67)

71c628d

Byron added a commit that referenced this issue Sep 22, 2021

feat(pack-create): control pack and object cache size in megabytes (#67)

60c9fad

Byron added a commit that referenced this issue Sep 22, 2021

fix(pack-create): don't include submodules in count… (#67)

faf6f81

…to avoid dealing with missing objects. It's still a good idea to handle these gracefully though, git itself seems to ignore them.

Byron added a commit that referenced this issue Sep 22, 2021

Use a custom hasher for 'seen' objects hashset… (#67)

70179e2

…for about 10% of performance, speeding up these lookups just a little bit.

Byron added a commit that referenced this issue Sep 22, 2021

perf: ObjectID specific hashers, using the fact that object ids are h…

f9232ac

…ashes (#67)

Byron added a commit that referenced this issue Sep 22, 2021

perf(tree): parse entry mode into number instead of comparing it to b…

83d591d

…yte strings (#67)

Byron added a commit that referenced this issue Sep 22, 2021

doc(EntryMode): describe variants(#67)

899c579

Byron added a commit that referenced this issue Dec 6, 2021

Properly count total objects during pack creation (#67)

bcb3d37

Byron mentioned this issue Jan 22, 2022

Server-side of fetch/pull #307

Open

17 tasks

Byron added a commit that referenced this issue Feb 14, 2022

Use an even faster way of counting (#67)

3877920

Byron added the C-tracking-issue An issue to track to track the progress of multiple PRs or issues label May 15, 2022

Byron mentioned this issue Nov 28, 2022

switch to custom Hasher implementation #630

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pack-generation MVP #67

pack-generation MVP #67

Byron commented Apr 9, 2021 •

edited

Loading

Progress

Performance Opportunities

Notes

packs in general

Pack generation

Byron commented Apr 9, 2021

Byron commented Apr 18, 2021

joshtriplett commented Apr 18, 2021

joshtriplett commented Apr 18, 2021

Byron commented Apr 19, 2021

Byron commented Apr 19, 2021

Byron commented Apr 19, 2021

Byron commented Apr 25, 2021

Byron commented Nov 28, 2022

pack-generation MVP #67

pack-generation MVP #67

Comments

Byron commented Apr 9, 2021 • edited Loading

Progress

Command-lines

Out of scope

User Stories

Interesting

Progress

Performance Opportunities

Notes

packs in general

Pack generation

Byron commented Apr 9, 2021

Byron commented Apr 18, 2021

joshtriplett commented Apr 18, 2021

joshtriplett commented Apr 18, 2021

Byron commented Apr 19, 2021

Byron commented Apr 19, 2021

Byron commented Apr 19, 2021

Byron commented Apr 25, 2021

Byron commented Nov 28, 2022

About opportunities for performance improvements

Byron commented Apr 9, 2021 •

edited

Loading