Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore gix APIs, experiment with gix-blame API #1453

Merged
merged 16 commits into from
Dec 25, 2024

Conversation

cruessler
Copy link
Contributor

@cruessler cruessler commented Jul 13, 2024

This is a draft PR that is supposed to help me get familiar with gix’s APIs as well as to experiment with possible APIs for gix-blame. Any feedback that helps me not go down the wrong path, is much appreciated! :-)

Review Tasks

  • try gix blame for fun
    • not yet implemented: passing blame to more than one parent is not implemented yet on README.md :/
  • rewrite commits to have one per crate that is touched (simplify handling)
  • remove all todo!()
  • replace .unwrap() with error handling
  • Outcome with statistics
  • tracing spans for perf info
  • rough docs
  • sort out blamed_file and original_file once and for all.
  • CLI improvements
    • make it easier to use a worktree-path through a pathspec (and use that in gix merge-file as well)
    • avoid re-splitting of content, otherwise it's too easy to get the visualization wrong, or using the wrong content (use tokenized input)
  • review TODOs
  • sort out assertion failures

Performance Opportunities

  • Cache tree lookups - this can cut the amount of ODB lookups in half at least, with tree-lookups being the slowest. Maybe such a cache could just be added to gix-object directly so everybody can use it.
    • No, that doesn't help. What's slow is accessing all the objects that need to be accessed.
  1. Implement custom graph walk which won't run down parents that don't have the path in question.
  2. Implement access of trees from commit-graph and fill that information into the traversal info by default.
  3. commit-graph with bloom filter, used to quickly check if a commit has a path.
Commit to track object access
commit 5138be4045149c8023c43f094bab9cb39ca1a3d0
Author: Sebastian Thiel <[email protected]>
Date:   Wed Dec 25 16:00:28 2024 +0100

    Try to make the slowest parts much faster with a custom cache

diff --git a/Cargo.lock b/Cargo.lock
index 365fb927f..c84dff747 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -1543,6 +1543,7 @@ dependencies = [
  "gix-filter",
  "gix-fs 0.12.1",
  "gix-hash 0.15.1",
+ "gix-hashtable 0.6.0",
  "gix-index 0.37.0",
  "gix-object 0.46.1",
  "gix-odb",
diff --git a/gix-blame/Cargo.toml b/gix-blame/Cargo.toml
index fc4baf3fe..f2c3478bf 100644
--- a/gix-blame/Cargo.toml
+++ b/gix-blame/Cargo.toml
@@ -20,6 +20,7 @@ gix-object = { version = "^0.46.0", path = "../gix-object" }
 gix-hash = { version = "^0.15.0", path = "../gix-hash" }
 gix-worktree = { version = "^0.38.0", path = "../gix-worktree", default-features = false, features = ["attributes"] }
 gix-traverse = { version = "^0.43.0", path = "../gix-traverse" }
+gix-hashtable = { version = "^0.6.0", path = "../gix-hashtable" }
 
 thiserror = "2.0.0"
 
diff --git a/gix-blame/src/file/function.rs b/gix-blame/src/file/function.rs
index 16384638e..05ed31751 100644
--- a/gix-blame/src/file/function.rs
+++ b/gix-blame/src/file/function.rs
@@ -1,10 +1,12 @@
 use super::{process_changes, Change, UnblamedHunk};
 use crate::{BlameEntry, Error, Outcome, Statistics};
 use gix_diff::blob::intern::TokenSource;
-use gix_hash::ObjectId;
-use gix_object::{bstr::BStr, FindExt};
+use gix_hash::{oid, ObjectId};
+use gix_object::{bstr::BStr, Data, FindExt, Header};
+use std::cell::RefCell;
 use std::num::NonZeroU32;
 use std::ops::Range;
+use std::sync::atomic::{AtomicUsize, Ordering};
 
 /// Produce a list of consecutive [`BlameEntry`] instances to indicate in which commits the ranges of the file
 /// at `traverse[0]:<file_path>` originated in.
@@ -66,6 +68,7 @@ where
         return Err(Error::EmptyTraversal);
     };
     let _span = gix_trace::coarse!("gix_blame::file()", ?file_path, ?suspect);
+    let odb = DebugOdb::new(odb);
 
     let mut stats = Statistics::default();
     let (mut buf, mut buf2, mut buf3) = (Vec::new(), Vec::new(), Vec::new());
@@ -213,6 +216,9 @@ where
     // I don’t know yet whether it would make sense to use a data structure instead that preserves
     // order on insertion.
     out.sort_by(|a, b| a.start_in_blamed_file.cmp(&b.start_in_blamed_file));
+
+    stats.odb_lookups = odb.lookups.load(Ordering::Relaxed);
+    stats.cacheable_odb_lookups = odb.cacheable.load(Ordering::Relaxed);
     Ok(Outcome {
         entries: coalesce_blame_entries(out),
         blob: blamed_file_blob,
@@ -300,7 +306,7 @@ fn coalesce_blame_entries(lines_blamed: Vec<BlameEntry>) -> Vec<BlameEntry> {
 
 #[allow(clippy::too_many_arguments)]
 fn tree_diff_at_file_path(
-    odb: impl gix_object::Find + gix_object::FindHeader,
+    odb: impl gix_object::Find,
     file_path: &BStr,
     id: ObjectId,
     parent_id: ObjectId,
@@ -449,3 +455,47 @@ fn find_path_entry_in_commit(
 pub(crate) fn tokens_for_diffing(data: &[u8]) -> impl TokenSource<Token = &[u8]> {
     gix_diff::blob::sources::byte_lines_with_terminator(data)
 }
+
+struct DebugOdb<Odb> {
+    inner: Odb,
+    seen: RefCell<gix_hashtable::HashSet<gix_hash::ObjectId>>,
+    lookups: AtomicUsize,
+    cacheable: AtomicUsize,
+}
+
+impl<Odb> gix_object::Find for DebugOdb<Odb>
+where
+    Odb: gix_object::Find,
+{
+    fn try_find<'a>(&self, id: &oid, buffer: &'a mut Vec<u8>) -> Result<Option<Data<'a>>, gix_object::find::Error> {
+        let res = self.inner.try_find(id, buffer)?;
+        if res.as_ref().map_or(false, |d| d.kind.is_tree()) {
+            if self.seen.borrow_mut().insert(id.to_owned()) {
+                self.lookups.fetch_add(1, Ordering::Relaxed);
+            } else {
+                self.cacheable.fetch_add(1, Ordering::Relaxed);
+            }
+        }
+        Ok(res)
+    }
+}
+
+impl<Odb> gix_object::FindHeader for DebugOdb<Odb>
+where
+    Odb: gix_object::FindHeader,
+{
+    fn try_header(&self, id: &oid) -> Result<Option<Header>, gix_object::find::Error> {
+        self.inner.try_header(id)
+    }
+}
+
+impl<Odb> DebugOdb<Odb> {
+    fn new(inner: Odb) -> Self {
+        DebugOdb {
+            inner,
+            seen: RefCell::new(Default::default()),
+            lookups: Default::default(),
+            cacheable: Default::default(),
+        }
+    }
+}
diff --git a/gix-blame/src/types.rs b/gix-blame/src/types.rs
index e0c8843b4..b4664ad50 100644
--- a/gix-blame/src/types.rs
+++ b/gix-blame/src/types.rs
@@ -20,7 +20,7 @@ pub struct Outcome {
 }
 
 /// Additional information about the performed operations.
-#[derive(Debug, Default, Copy, Clone)]
+#[derive(Debug, Default, Clone, Copy)]
 pub struct Statistics {
     /// The amount of commits it traversed until the blame was complete.
     pub commits_traversed: usize,
@@ -33,6 +33,11 @@ pub struct Statistics {
     /// The amount of blobs there were compared to each other to learn what changed between commits.
     /// Note that in order to diff a blob, one needs to load both versions from the database.
     pub blobs_diffed: usize,
+
+    /// testing
+    pub odb_lookups: usize,
+    /// testing
+    pub cacheable_odb_lookups: usize,
 }
 
 impl Outcome {

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great to see this PR and I am looking forward to witnessing it evolve over time :)!

gix-blame/tests/blame.rs Outdated Show resolved Hide resolved
git commit -q --allow-empty -m c5
git tag at-c5
git merge branch1 -m m1b1
echo "line 2" >> file.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do the same - start extremely simple, maybe do the first rough implementation so it passes this test, and then think of some tougher cases to throw at it, validating that they still come out right.

Maybe it's worth investing into a baseline test which parses the output of git blame to get the expected output, which then has to be matched by the algorithm. These are typically the second stage as they are more obscure, but make it easier to get correct expectations for a lot of possibly complex inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input, sounds like a great plan!

@cruessler
Copy link
Contributor Author

@Byron I think I’ve now gotten to a point where I’ve learned enough about gixoxide’s APIs in order to hopefully be able write the first meaningful test. Would you mind looking at the PR’s current state? I’m not sure whether I it would be better to try and find a good API for struct Blame or to continue adding to the already quite large test it_works.

My goal now is to write a test that runs a complete blame for the sample repo that is part of this PR.

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, you already dug deep into the most complex APIs gitoxide has to offer, I am pretty sure it can't get much worse than that. Having it, however, helps to know what data is needed which will help tremendously with defining the API later.

I think continuing with the test to try and complete the algorithm would be good, even though I'd probably start with refactoring the 'setup' portion of the test a little more to be able to focus on the blame algorithm.

But that's details, I think you are on the right track and this PR already does everything that's needed to do a blame.

Once the core algorithm is proven, as most naive and simple version of course, I'd probably create a baseline of some samples with git itself, parse the result and use it as expectation for the algorithm created here.

I hope that helps.


use gix_diff::blob::intern::Token;
use gix_hash::ObjectId;
use gix_odb::pack::FindExt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd want to use the equally named gix_object::FindExt, it's easier to use.

@cruessler
Copy link
Contributor Author

Thanks, that’s very helpful already! I think I wanted to mostly make sure I’m still on the right track and not missing anything. I’ll continue working on the test, then, and worry about the public API later.

@Byron
Copy link
Member

Byron commented Aug 11, 2024

Yes, I think it's great to keep up the explorative approach using cargo test and then put that behind an API for abstraction, along with the first simple 'actual' test. And from there, one can iterate to validate more complex cases, while slowly getting to the kind of observable and interruptible API that will eventually be needed for gitui.

@cruessler
Copy link
Contributor Author

@Byron This is a quick status update! The algorithm is now able to generate a blame for the sample repo. I extended the test to compare this blame against git blame’s output. Next, I will review the algorithm in more detail (the loop count, for example, is still hard-coded), then I would start adding more cases, trying to abstract/extract pieces of code on the way.

@cruessler
Copy link
Contributor Author

@Byron Sooner than expected, there seems to have been quite a bit of progress! I extended the test repo so that it now also contains commits that don’t touch the file being blamed. I also added a file that contains multiline hunks. There’s now 2 test cases that cover one file each. Also, most of the logic is now inside the loop, so I was able to delete a couple of lines. I was thinking of maybe trying to tackle merge commits next. In its current state, the algorithm only diffs against a single parent. What do you think? (Also, I’ll address your comment above.)

@Byron
Copy link
Member

Byron commented Aug 23, 2024

That's great news!

In its current state, the algorithm only diffs against a single parent.

That's interesting - I just know that Git usually ignores merge-commits, but I don't know if it includes them in blames nonetheless. I never thought about how that would work, and would probably see what Git would do. It's something I do in general, that is, try to understand what Git can do and ask if this would (eventually) be possible with the gitoxide implementation. When it comes to diffing, I definitely have a lot to learn.

Besides that, I definitely recommend to keep clippy happy and run it locally at least occasionally - ideally CI can stay green even during early development, and I found it worth it as it typically leads to cleaner code. Often I find myself doing a push&run, just to return one more time to quickly fix CI if it breaks so I can continue 'clean' in the morning :D.

@cruessler
Copy link
Contributor Author

@Byron I think I’m now at a point where I’d appreciate a brief sync or a bit of feedback on how to best continue. Overall, I’m happy with the results so far. I’ve tested the implementation on linear commit histories with up to 90 commits, and for the files that I’ve blamed, the results match git’s results. At this point, my next goals are centered around gaining more confidence in the implementation and to adding parts that are still missing compared to git’s implementation. Also, I had a lot of fun working on this PR, so thanks for giving me the space! :-)

These are some observations and remaining issues:

  • Merges still need to be taken into account, in the sense that while git indeed seems to ignore them, my implementation currently only follows one parent instead of all parents.
  • There seem to be differences between imara-diff and git diff. I want to do more research on this topic as it directly affects the result of a blame.
  • Running the current implementation on even slightly larger repositories (I ran some tests inside gitoxide) is slow. I suspect this is due to the part where two trees are diffed to find out whether the file under blame was changed. I might also just have missed an API that does exactly what I want.
  • Some of the tests are quite verbose, there’s probably opportunities for some de-duplication using macros.
  • Most of the recent fixes didn’t add test cases to make_blame_repo.sh, but I instead added very specific test cases like process_change_works_added_hunk that only test a single iteration of the basic building block.
  • The implementation doesn’t follow renames.
  • I even found a difference between git and libgit2 when it comes to coalescing blame entries before returning them to the caller. I’ve documented that fact in a comment.

This is a bit of a braindump and it probably is missing some important context. I hope that is’s useful anyway and am happy to elaborate on anything as needed. :-)

@cruessler
Copy link
Contributor Author

Also, there’s errors in the CI pipeline, but they don’t seem to be related to any change in this PR.

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I had a lot of fun working on this PR, so thanks for giving me the space! :-)

I am glad to hear that, and love the progress you made. This is the PoC already, proving that it generally works - awesome!

@Byron I think I’m now at a point where I’d appreciate a brief sync or a bit of feedback on how to best continue. Overall, I’m happy with the results so far. I’ve tested the implementation on linear commit histories with up to 90 commits, and for the files that I’ve blamed, the results match git’s results.

I could also imagine to work towards getting a bare-bones version of gix blame quickly so that the API can already be evolved to make the non-interactive case usable, while also opening it up for more manual and performance tests.

  • Merges still need to be taken into account, in the sense that while git indeed seems to ignore them, my implementation currently only follows one parent instead of all parents.

As gix-blame is on a plumbing level, it wouldn't have to care about this at all. However, upper layers will make decisions about that, and these typically align with Git when feasible.

  • There seem to be differences between imara-diff and git diff. I want to do more research on this topic as it directly affects the result of a blame.

Once you found one and can reproduce it with a simple diff, we would make it reproducible and work on fixing imara-diff. Please note that higher-levels also configure the diff algorithm correctly, so git blame might be different if it picks up a different value for diff.algorithm. But it's very possible that the issue is legitimate and it will be an ultra-high-priority fix for me.

  • Running the current implementation on even slightly larger repositories (I ran some tests inside gitoxide) is slow. I suspect this is due to the part where two trees are diffed to find out whether the file under blame was changed. I might also just have missed an API that does exactly what I want.

Tree-diffs need object-caches setup to be competitive, and also need the best possible zlib backend (zlib-ng). Both are easy to miss.

  • Some of the tests are quite verbose, there’s probably opportunities for some de-duplication using macros.

I did some of that, it looks less noisy now. To me it seems fine, particularly because there only appear to be two hard todo!()s left which I'd think you can lure out with two more of these tests. So it's unlikely these more verbose tests keep growing forever.

  • Most of the recent fixes didn’t add test cases to make_blame_repo.sh, but I instead added very specific test cases like process_change_works_added_hunk that only test a single iteration of the basic building block.

I love that.

  • The implementation doesn’t follow renames.

It's rename tracking that is setup by the higher-level. But probably blame also needs to know about it, and eventually that should probably also be implemented.
However, it can wait until the API is a bit more fleshed out.

  • I even found a difference between git and libgit2 when it comes to coalescing blame entries before returning them to the caller. I’ve documented that fact in a comment.

That's a great catch and the level of diligence that gitoxide requires. Of course, ideally it's not applied as post-process, but we might get there in stage three: make it work fast, now it's stage one: make it work.

Next Steps

Whenever you feel ready, I recommend to address the TODO's I left in Cargo.toml and adjust the API to not do any of the initialization work itself.
From there, my next step would be to make it usable in gix blame (even without wiring it up in gix (the crate) for now), so be able to validate the API for non-interactive cases. What's most important, and something that's not tested at all right now, is that having a gix blame will make clear how it shall be possible for callers to access actual lines of code.

To do that correctly, they'd need access to the original interner and the tokens behind the ranges that are used in the blame. It's really important to keep that in sync, and avoid making the impression that the caller should re-split blobs into lines.

I hope that helps to continue, please feel free to ask follow-up questions.

PS: I am very happy with the level of independence of your work - this review and refactor took me a little more than 2 hours, which is on the low-end for the stage the feature is in (i.e. it's seemingly nearly working correctly).

@cruessler
Copy link
Contributor Author

Thanks for the review and the feedback, that’s very helpful! I’ll have a look and continue working or ask follow-up questions. :-)

@Byron
Copy link
Member

Byron commented Sep 23, 2024

It's going to be good, thanks again for all your work on this! My greatest hope is that you keep having fun :).

To do that correctly, they'd need access to the original interner and the tokens behind the ranges that are used in the blame. It's really important to keep that in sync, and avoid making the impression that the caller should re-split blobs into lines.

Maybe re-using the tokens isn't feasible after all as each diff creates a new set of them. The interner can keep all of the tokens, probably a good thing also for performance, but I wonder if it's feasible to keep all tokens around. In the end, each one is only a u32, so memory isn't an issue. It's more about ease of use… and… thinking about it more… reproducing the blame lines from just token ranges can be done if the caller would re-access all blobs that were used, and re-split the lines exactly like was done for the diff. So ideally, that's controllable from the outside, making the diff more abstract.

Lastly, my feeling here is that it would be best if tokens would be made available on a per-blob basis so that a single interner can keep all of these tokens for later access.

But the above are just thoughts, gix blame will probably make clear quickly what it 'wants' to be.

@cruessler
Copy link
Contributor Author

@Byron I think I was able to identify a case where imara-diff differs from git diff.

Before:

  line 1

  line 2

  line 3

After:

  line 1

  line in between

  line 2

  line in between

  line 3

imara-diff breaks this up as [Unchanged(0..2), Added(2..4, 0), Unchanged(4..5), Added(5..7, 0), Unchanged(7..9)] (ranges are exclusive at the end) while git diff comes to a different result: [Unchanged(0..2), Added(2..4, 0), Unchanged(4..6), Added(6..8, 0), Unchanged(8..9)]. This is with diff.algorithm = histogram and Algorithm::Histogram.

The difference is with respect to the newlines surrounding the second line in between. Interestingly enough, I was not able to observe this behaviour when there was only one block containing line in between added.

@Byron
Copy link
Member

Byron commented Sep 24, 2024

That's great! Could this be reproduced with a baseline test, maybe also producing versions with Meyers and Histogram respectively?
It's notable that the diff takes newlines into account, and it's something that Git might have special behaviour around, I didn't check.

Before reeling in Pascal who can most certainly enlighten us, I'd love to have a failing test along with all the facts that we can collect.

Thanks again for your help with this.

@cruessler
Copy link
Contributor Author

I just added two failing test cases. Since blame_file doesn’t take the algorithm as an argument yet, there’s possibly a mismatch for the Myers case (because the algorithm is hard-coded to Algorithm::Histogram currently), but I wanted to add the test case anyway.

@cruessler
Copy link
Contributor Author

I’ve also moved most of the setup out of blame_file. Is that what you had in mind? There’s still dependencies left on gix-ref and gix-traverse, but they seem to be a bit harder to remove.

@Byron
Copy link
Member

Byron commented Sep 24, 2024

Great, thanks, I will take a closer look later.

I’ve also moved most of the setup out of blame_file. Is that what you had in mind? There’s still dependencies left on gix-ref and gix-traverse, but they seem to be a bit harder to remove.

Yes, that's exactly it. I thought traversal can be reduced to an iterator yielding object-ids, but it's nothing I validated and I am probably missing an important detail.
Generally, I am optimistic that both gix-ref and gix-traverse can go away eventually.

@cruessler
Copy link
Contributor Author

Great, thanks, I will take a closer look later.

I’ve also moved most of the setup out of blame_file. Is that what you had in mind? There’s still dependencies left on gix-ref and gix-traverse, but they seem to be a bit harder to remove.

Yes, that's exactly it. I thought traversal can be reduced to an iterator yielding object-ids, but it's nothing I validated and I am probably missing an important detail. Generally, I am optimistic that both gix-ref and gix-traverse can go away eventually.

The issue might be that we need to take an Iterator that yields Items that themselves have commit_iter() which yields items that have parent_ids(), so there’s a bit of nesting. I imagine it’s certainly possible to abstract that into a trait or something, but it felt less like a low-hanging fruit than the previous changes. :-)

The dependency on gix_ref is only for BStr as far as I can see which can also be imported from gix_object I found out (gix-diff does that, but it also directly imports from bstr).

@Byron
Copy link
Member

Byron commented Sep 24, 2024

I took a closer look and have factored out the remaining bits that can be removed in a straightforward fashion. Maybe that's mostly it already.

The test-case is wonderful by the way as it should be straightforward to narrow down the cause of the issue, or at least get to a point where Pascal can be invited.

@Byron
Copy link
Member

Byron commented Sep 24, 2024

I think cry for help it is, @pascalkuthe 😅.

After applying this patch:

diff --git a/gix-blame/src/lib.rs b/gix-blame/src/lib.rs
index a737965a6..c4ec75cfa 100644
--- a/gix-blame/src/lib.rs
+++ b/gix-blame/src/lib.rs
@@ -815,6 +815,13 @@ pub fn blame_file<E>(
 
         let outcome = resource_cache.prepare_diff().unwrap();
         let input = outcome.interned_input();
+        for token in &input.before {
+            eprintln!("{:?}", input.interner[*token].as_bstr());
+        }
+        dbg!("after");
+        for token in &input.after {
+            eprintln!("{:?}", input.interner[*token].as_bstr());
+        }
         let number_of_lines_in_destination = input.after.len();
         let change_recorder = ChangeRecorder::new(number_of_lines_in_destination.try_into().unwrap());
         let changes = gix_diff::blob::diff(gix_diff::blob::Algorithm::Histogram, &input, change_recorder);
diff --git a/gix-blame/tests/blame.rs b/gix-blame/tests/blame.rs
index 437e3de59..039005e9f 100644
--- a/gix-blame/tests/blame.rs
+++ b/gix-blame/tests/blame.rs
@@ -219,7 +219,7 @@ mktest!(same_line_changed_twice, "same-line-changed-twice", 2);
 mktest!(coalesce_adjacent_hunks, "coalesce-adjacent-hunks", 1);
 
 #[test]
-#[ignore = "TBD: figure out what the problem is"]
+// #[ignore = "TBD: figure out what the problem is"]
 // As of 2024-09-24, these tests are expected to fail.
 //
 // Context: https://github.com/Byron/gitoxide/pull/1453#issuecomment-2371013904

it shows the following:

"  line 1\n"
"\n"
"  line 2\n"
"\n"
"  line 3\n"
[gix-blame/src/lib.rs:821:9] "after" = "after"
"  line 1\n"
"\n"
"  line in between\n"
"\n"
"  line 2\n"
"\n"
"  line in between\n"
"\n"
"  line 3\n"

From this I see that the tokens are exactly what's expected as the basis for a Histogram diff. It's notable that Myers and MyersMinimal both produce the same output.

The baseline for the expected result is the following.

Screenshot 2024-09-24 at 20 40 31

It's well balanced and also quite intuitive.

However, our result seems off:

Diff < left / right > :
 [
     BlameEntry {
         range_in_blamed_file: 0..2,
         range_in_original_file: 0..2,
         commit_id: Sha1(0b56209a29a36af25c845a93f28ce35c1ac3de09),
     },
     BlameEntry {
         range_in_blamed_file: 2..4,
         range_in_original_file: 2..4,
         commit_id: Sha1(6d23453055b4a855092a0be7b3e4741414c1e547),
     },
     BlameEntry {
<        range_in_blamed_file: 4..5,
<        range_in_original_file: 2..3,
>        range_in_blamed_file: 4..6,
>        range_in_original_file: 2..4,
         commit_id: Sha1(0b56209a29a36af25c845a93f28ce35c1ac3de09),
     },
     BlameEntry {
<        range_in_blamed_file: 5..7,
<        range_in_original_file: 5..7,
>        range_in_blamed_file: 6..8,
>        range_in_original_file: 6..8,
         commit_id: Sha1(6d23453055b4a855092a0be7b3e4741414c1e547),
     },
     BlameEntry {
<        range_in_blamed_file: 7..9,
<        range_in_original_file: 3..5,
>        range_in_blamed_file: 8..9,
>        range_in_original_file: 4..5,
         commit_id: Sha1(0b56209a29a36af25c845a93f28ce35c1ac3de09),
     },
 ]

Just to be sure that it's not the hunk-recorder that's the actual foundation for the blame, I also printed the hunks that it finds.

[gix-blame/src/lib.rs:254:9] &before = 2..2
[gix-blame/src/lib.rs:254:9] &after = 2..4
[gix-blame/src/lib.rs:254:9] &before = 3..3
[gix-blame/src/lib.rs:254:9] &after = 5..7

They seem to fit.

I'd be happy to investigate this further, but wanted to ask you, @pascalkuthe , if you have any ideas to guide this. Maybe it's super-obvious to you what's going on.

Something I can imagine is happening here is that blame does something special beyond the normal diffing, I myself didn't have any look at it yet.

Thanks for sharing your thoughts.

@pascalkuthe
Copy link
Contributor

at first glance this looks like the slider problem (https://github.com/mhagger/diff-slider-tools/blob/master/README.md).

Basically the algorithm can say that your diff is:

"Line 1\n"
+"\n"
+"  line in between\n"
\n

or it can say that the diff is:

"Line 1\n"
"\n"
+"  line in between\n"
+ \n

Both are correct (minimal) difss so there is nothing wrong with the diff algorithm/results it's just that there are multiple different possible choices a diff can make. Git has some postprocessing heuristics to make a choice but from an objective point of view one choice isn't wrong just that some choices feel more intuitive to humans so it's not a bug in the diff algorithm just a missing feature.

Git resolves those ambiguities by a heuristic. Basic summary:

  • slide sliders as far down and up as possible to check if multiple hunks can be merged
  • finally slide the slider as far down as possible
  • a fairly complex indentation based heuristic developed and described in that repo I linked (semi-recent addition to git)

I have an implementation of the first two coupled with some (necessary) API change that was aimed to be the first breaking imara-diff change. I started porting helix to the new API but got a bit stuck timewise and didn't get around to finishing that.

The third heuristic is very involved implementation wise and also difficult to add from an API perspective since it only makes sense for (line-segmented) text but imara-diff operates on abstract tokens and allows diffing arbitrary data (segmented in arbitrary ways).

I think we could aim to get the basic heuristic in, the more advanced indentation based code I am not sure I have time to work on at the moment. For this PR I would suggest to just update the fixtures to match whatever imara-diff produces and then update them as imara-diff improves in that regard.

@cruessler
Copy link
Contributor Author

Quick side note to my last commit: I’ve had to add another parameter to blame_file, needed to associate the first UnblamedHunk with an ObjectId. This feels like duplication, though, as that information should, in theory, also be accessible through traverse. Since I did not quickly find a way to get start_id from traverse, I decided to add it as a parameter to blame_file.

@Byron
Copy link
Member

Byron commented Sep 25, 2024

at first glance this looks like the slider problem (https://github.com/mhagger/diff-slider-tools/blob/master/README.md).

Thanks so much for sharing!

I also do remember now that you already shared this with me, probably in person, so it's clear that these heuristics are a tuneable and there isn't really a right answer.

After reading at the link location really makes me think that imara-diff will one day outperform Git (if it offers new heurstics):

An implementation of an alternative slider positioning heuristic that significantly outperforms both git diff and git diff --compaction-heuristic as of Git 2.9.0.

Thus, this is likely going to be a place where gitoxide will differ from Git and hopefully at some point it will differ to the better.

Until then, we go with your suggestion and just keep these examples in our baselines and mark them accordingly. Maybe one day they fit, or one day we mark ours as preferred.

PS: I really can't wait for these heuristics to be implemented, can't wait for ident-new :D.

@cruessler
Copy link
Contributor Author

@Byron This is another quick status update: I’ve now refactored and simplified the implementation a bit, mostly to make the concepts used more easy to discern. I’m also quite far already in my approach to being able to blame commits with more than one parent. So, from my side, the point gets closer where I would call the first implementation done. For context, I just wanted to ask for feedback, in particular on how and when you think this should be merged. I’m mainly asking because there’s now more than one area I’d like to focus on, and I can imagine it might be easier to spread the work into PRs that are more specific to certain aspects, instead of keeping everything in this one. I can still follow it, but it’s already very layered in the sense that there’s quite a few discussions and threads overlapping. :-) What are your thoughts? I’m also fine with continuing in this PR, so whatever works best for you!

@Byron
Copy link
Member

Byron commented Sep 28, 2024

Thanks for the update :).

Generally I am happy to follow your preference when it comes to merging, so what follows is just my preference, which isn't the deciding factor here.

What I'd do before merging is to 'quickly' wire it up into gix blame to be forced to reproduce the lines with the blame information. The last time I checked, this would have been difficult or at least easy to get wrong.

And since you probably have some more areas of work already in your mind, I think it would be good to collect them in a task list or tracking issue, so we both can keep an overview. It's probably best if you'd create a tracking issue for that so you can edit it primarily.

In summary, a tracking issue we should have, and everything else is up to you then - you can ask me to review and merge, and that's what I will do, with or without gix blame.

I hope that helps.

@cruessler
Copy link
Contributor Author

Thanks, that helps a lot! I think I’ll give wiring it up into gix blame a try, and I’ll create a tracking issue.

@cruessler
Copy link
Contributor Author

Quick question: did I put the code for wiring up gix blame in the right locations? (Feel free to change as necessary!)

They are used with inverse meaning compared to the current documentation.
It's easier to adjust the variable names.
@Byron
Copy link
Member

Byron commented Dec 25, 2024

And it's nearly there, all that seems left is a more standardized handling of the input path in the CLI.

When benchmarking/testing more, I came across this gem as it appears to be possible to try to an unblamed hunk into a blame entry without the suspect being present there.

This occurred in the Git repository itself.

git ( master) [?] via 🐍
❯ git rev-parse @
d882f382b3d939d90cfa58d17b17802338f05d66

git ( master) [?] via 🐍
❯ cargo run --manifest-path /Users/byron/dev/github.com/GitoxideLabs/gitoxide/Cargo.toml --bin gix -- --verbose --trace blame -s README.md
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.17s
     Running `/Users/byron/dev/github.com/GitoxideLabs/gitoxide/target/debug/gix --verbose --trace blame -s README.md`
thread 'main' panicked at gix-blame/src/file/mod.rs:475:14:
Private and only called when we now `commit_id` is in the suspect list
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The .expect() was an unwrap() previously.

When 'looking for trouble' with git ls-files | xargs -n1 /Users/byron/dev/github.com/GitoxideLabs/gitoxide/target/release/gix --verbose --trace blame -s other files would also show this exact issue.

Your help would be appreciated, as I didn't really get to grasping how suspects really work. But I hope to get there.

@Byron
Copy link
Member

Byron commented Dec 25, 2024

I just 'fixed' it by assuming it's naturally possible that there is no entry for a suspect, so there is no need to fail. Now it runs through and produces the correct result.
In the background I am collecting more data and blame every file in the Git repository. It takes a while, but my hope is that it won't panic anymore.

@Byron
Copy link
Member

Byron commented Dec 25, 2024

I am still on solidifying the implementation, and running it on every file in the Git repository is a good way to run into the last remaining .expect()s.

It's also a great motivation to try improving performance, as it takes a while otherwise.

Just to share early, I added a counter to see if it's worth it to cache information, and by the looks of it, this really is the case.

Statistics {
    commits_traversed: 33439,
    commits_to_tree: 66357,
    trees_decoded: 66357,
    trees_diffed: 29,
    blobs_diffed: 27,
    odb_lookups: 67556,
    cacheable_odb_lookups: 74971,
}

Just need to figure out if it's trees or commits that should be cached additionally, hoping that it is trees mostly.

Edit: it definitely runs into every single assertion :D - the real world knows no mercy.

@Byron
Copy link
Member

Byron commented Dec 25, 2024

And a pretty good performance example (without commit-graph/bloom filter) where Git is 7.5x faster:

git ( master) [?] via 🐍
❯ /usr/bin/time -lp git blame t/t4013/diff.rev-list_--children_HEAD
15 years ago    Thomas Rast     fcbc6efc7c5│  1 │ $ git rev-list --children HEAD
                                           │  2 │ 59d314ad6f356dd08601a4cd5e530381da3e3c64
                                           │  3 │ c7a2ab9e8eac7b117442a607d5a9b3950ae34d5a 59d314ad6f356dd08601a4cd5e530381da3e3c64
                                           │  4 │ 9a6d4949b6b76956d9d5e26f2791ec2ceff5fdc0 59d314ad6f356dd08601a4cd5e530381da3e3c64
                                           │  5 │ 1bde4ae5f36c8d9abe3a0fce0c6aab3c4a12fe44 9a6d4949b6b76956d9d5e26f2791ec2ceff5fdc0
                                           │  6 │ 444ac553ac7612cc88969031b02b3767fb8a353a 1bde4ae5f36c8d9abe3a0fce0c6aab3c4a12fe44 c7a2ab9e8eac7b117442a607d5a9b3950ae34d5a
                                           │  7 │ $
real 0.48
user 0.43
sys 0.04
           335085568  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               22710  page reclaims
                  25  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   5  voluntary context switches
                  50  involuntary context switches
          7667095492  instructions retired
          1859928922  cycles elapsed
           139412224  peak memory footprint

git ( master) [?] via 🐍
❯

git ( master) [?] via 🐍
❯ /usr/bin/time -lp git blame t/t4013/diff.rev-list_--children_HEAD

git ( master) [?] via 🐍
❯ /usr/bin/time -lp /Users/byron/dev/github.com/GitoxideLabs/gitoxide/target/release/gix --verbose --trace  blame -s t/t4013/diff.rev-list_--children_HEAD
 16:46:34 tracing INFO     run [ 4.12s | 17.02% / 100.00% ]
 16:46:34 tracing INFO     ┝━ ThreadSafeRepository::discover() [ 656µs | 0.00% / 0.02% ]
 16:46:34 tracing INFO     │  ┕━ open_from_paths() [ 578µs | 0.00% / 0.01% ]
 16:46:34 tracing INFO     │     ┕━ gix_odb::Store::at() [ 381µs | 0.01% ]
 16:46:34 tracing INFO     ┝━ gix_index::File::at() [ 382µs | 0.00% / 0.01% ]
 16:46:34 tracing DEBUG    │  ┝━ gix::open_index::hash_index [ 181µs | 0.00% ] path: "./.git/index"
 16:46:34 tracing DEBUG    │  ┕━ gix_index::State::from_bytes() [ 144µs | 0.00% ] options: Options { thread_limit: Some(0), min_extension_block_in_bytes_for_threading: 0, expected_checksum: None }
 16:46:34 tracing INFO     ┕━ gix_blame::file() [ 3.42s | 82.96% ] file_path: "t/t4013/diff.rev-list_--children_HEAD" | suspect: Sha1(d882f382b3d939d90cfa58d17b17802338f05d66)
fcbc6efc 1 1 $ git rev-list --children HEAD
fcbc6efc 2 2 59d314ad6f356dd08601a4cd5e530381da3e3c64
fcbc6efc 3 3 c7a2ab9e8eac7b117442a607d5a9b3950ae34d5a 59d314ad6f356dd08601a4cd5e530381da3e3c64
fcbc6efc 4 4 9a6d4949b6b76956d9d5e26f2791ec2ceff5fdc0 59d314ad6f356dd08601a4cd5e530381da3e3c64
fcbc6efc 5 5 1bde4ae5f36c8d9abe3a0fce0c6aab3c4a12fe44 9a6d4949b6b76956d9d5e26f2791ec2ceff5fdc0
fcbc6efc 6 6 444ac553ac7612cc88969031b02b3767fb8a353a 1bde4ae5f36c8d9abe3a0fce0c6aab3c4a12fe44 c7a2ab9e8eac7b117442a607d5a9b3950ae34d5a
fcbc6efc 7 7 $
Statistics {
    commits_traversed: 57973,
    commits_to_tree: 114610,
    trees_decoded: 342268,
    trees_diffed: 1,
    blobs_diffed: 0,
}
real 4.13
user 4.09
sys 0.03
           206684160  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               13023  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                 118  involuntary context switches
         48247484655  instructions retired
         16734900095  cycles elapsed
            76841968  peak memory footprint

However, I think the problem is that we have tremendous uncached amounts of work to find the one commit that adds this file 55k commits ago. Git clearly manages to not redo much work.

@Byron
Copy link
Member

Byron commented Dec 25, 2024

Based on this worst case…

gix --verbose --trace  blame -s t/t4013/diff.rev-list_--children_HEAD

…I kept tinkering and got it to this point.

❯ /usr/bin/time -lp /Users/byron/dev/github.com/GitoxideLabs/gitoxide/target/release/gix --verbose --trace  blame -s t/t4013/diff.rev-list_--children_HEAD
 19:07:39 tracing INFO     run [ 2.97s | 1.09% / 100.00% ]
 19:07:39 tracing INFO     ┝━ ThreadSafeRepository::discover() [ 475µs | 0.00% / 0.02% ]
 19:07:39 tracing INFO     │  ┕━ open_from_paths() [ 395µs | 0.01% / 0.01% ]
 19:07:39 tracing INFO     │     ┕━ gix_odb::Store::at() [ 230µs | 0.01% ]
 19:07:39 tracing INFO     ┝━ gix_index::File::at() [ 372µs | 0.00% / 0.01% ]
 19:07:39 tracing DEBUG    │  ┝━ gix::open_index::hash_index [ 201µs | 0.01% ] path: "./.git/index"
 19:07:39 tracing DEBUG    │  ┕━ gix_index::State::from_bytes() [ 141µs | 0.00% ] options: Options { thread_limit: Some(0), min_extension_block_in_bytes_for_threading: 0, expected_checksum: None }
 19:07:39 tracing INFO     ┕━ gix_blame::file() [ 2.94s | 98.88% ] file_path: "t/t4013/diff.rev-list_--children_HEAD" | suspect: Sha1(d882f382b3d939d90cfa58d17b17802338f05d66)
fcbc6efc 1 1 $ git rev-list --children HEAD
fcbc6efc 2 2 59d314ad6f356dd08601a4cd5e530381da3e3c64
fcbc6efc 3 3 c7a2ab9e8eac7b117442a607d5a9b3950ae34d5a 59d314ad6f356dd08601a4cd5e530381da3e3c64
fcbc6efc 4 4 9a6d4949b6b76956d9d5e26f2791ec2ceff5fdc0 59d314ad6f356dd08601a4cd5e530381da3e3c64
fcbc6efc 5 5 1bde4ae5f36c8d9abe3a0fce0c6aab3c4a12fe44 9a6d4949b6b76956d9d5e26f2791ec2ceff5fdc0
fcbc6efc 6 6 444ac553ac7612cc88969031b02b3767fb8a353a 1bde4ae5f36c8d9abe3a0fce0c6aab3c4a12fe44 c7a2ab9e8eac7b117442a607d5a9b3950ae34d5a
fcbc6efc 7 7 $
Statistics {
    commits_traversed: 57973,
    commits_to_tree: 76186,
    trees_decoded: 226996,
    trees_diffed: 1,
    blobs_diffed: 0,
}
real 2.98
user 2.95
sys 0.03

I also know for sure that it takes 2.45s to decode 78494 trees for the first time and there is no way around that.

Statistics {
    commits_traversed: 57973,
    commits_to_tree: 76186,
    trees_decoded: 78496,
    trees_diffed: 1,
    blobs_diffed: 0,
    tree_cache_hits: 148500,
    tree_cache_misses: 78494,
    tree_cache_size: 78494,
}
real 5.57
user 4.78

The above is a version that aggressively cached each tree - it was slower overall (despite unoptimized) but the cache isn't much faster than what's already implemented, and the slow thing is accessing the 58k root trees, along with their unique subtrees to get to the file in question. The ever-growing tree-cache was about 3GB in size as well (with everything, unoptimzed) so absolutely not worth it.

Once again I am at a point where pack-access is slow, and object-caching isn't effective as it needs new objects. There is pack-entry caching which cuts delta-chains short. By default, the cache isn't very effective but so fast that it's still worth it in all the cases.

❯ /usr/bin/time -lp /Users/byron/dev/github.com/GitoxideLabs/gitoxide/target/release/gix --verbose --trace  blame -s t/t4013/diff.rev-list_--children_HEAD
StaticLinkedList<64>[12f023218]: 60985 / 395782 (hits/misses) = 15.41%, puts = 80244
MemoryCappedObjectHashmap(4796186B)[600002730508]: 15547 / 139260 (hits/misses) = 11.16%, puts = 139260

It's possible to make the cache much larger…

❯ GIX_PACK_CACHE_MEMORY=500m /usr/bin/time -lp /Users/byron/dev/github.com/GitoxideLabs/gitoxide/target/release/gix --verbose --trace  blame -s t/t4013/diff.rev-list_--children_HEAD
MemoryCappedHashmap(524288000B)[600000eb8238]: 81654 / 116634 (hits/misses) = 70.01%, puts = 80221
MemoryCappedObjectHashmap(4796186B)[6000000b8388]: 157668 / 145639 (hits/misses) = 108.26%, puts = 145639
[..]

…which brings the time down to 1.73s down from 3.04s, so it's pretty good. However, it's still a far cry from Git which finishes this in 0.43s, 4x (!!).

With the improved pack-cache, blame::file is completely dominated by looking up file-paths in the trees of all commits.

Screenshot 2024-12-25 at 19 26 30

And within that function, about 60% go into tree decoding, 20% each go into finding the commit and then the root tree of the commit.

Screenshot 2024-12-25 at 19 36 54

Finally, let's compare this with Git:

Screenshot 2024-12-25 at 19 35 53

It even does full diffs all the time, but manages to skip over most. This is the moment where I realise that it picked up the commit-graph and bloom filters… .

Without the commit-graph, it looks like this:

Screenshot 2024-12-25 at 19 40 14

So… never mind, it's still ludicrously fast and I don't know how this is even possible? It has to access all these commits and trees and can't skip, especially without bloom filters. But it's still this fast?

The commit that will never be
commit 95569ac06cb53dabb675b9eae377e2e5b24b973d
Author: Sebastian Thiel <[email protected]>
Date:   Wed Dec 25 16:55:32 2024 +0100

    use a custom tree lookup implementation with our own cache.
    
    Maybe one day, this can be placed into `gix-object` as well.

diff --git a/Cargo.lock b/Cargo.lock
index 365fb927f..c84dff747 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -1543,6 +1543,7 @@ dependencies = [
  "gix-filter",
  "gix-fs 0.12.1",
  "gix-hash 0.15.1",
+ "gix-hashtable 0.6.0",
  "gix-index 0.37.0",
  "gix-object 0.46.1",
  "gix-odb",
diff --git a/gix-blame/Cargo.toml b/gix-blame/Cargo.toml
index fc4baf3fe..f2c3478bf 100644
--- a/gix-blame/Cargo.toml
+++ b/gix-blame/Cargo.toml
@@ -20,6 +20,7 @@ gix-object = { version = "^0.46.0", path = "../gix-object" }
 gix-hash = { version = "^0.15.0", path = "../gix-hash" }
 gix-worktree = { version = "^0.38.0", path = "../gix-worktree", default-features = false, features = ["attributes"] }
 gix-traverse = { version = "^0.43.0", path = "../gix-traverse" }
+gix-hashtable = { version = "^0.6.0", path = "../gix-hashtable" }
 
 thiserror = "2.0.0"
 
diff --git a/gix-blame/src/file/function.rs b/gix-blame/src/file/function.rs
index f34749a41..bb9d48943 100644
--- a/gix-blame/src/file/function.rs
+++ b/gix-blame/src/file/function.rs
@@ -69,8 +69,11 @@ where
 
     let mut stats = Statistics::default();
     let (mut buf, mut buf2, mut buf3) = (Vec::new(), Vec::new(), Vec::new());
-    let blamed_file_entry_id = find_path_entry_in_commit(&odb, &suspect, file_path, &mut buf, &mut buf2, &mut stats)?
-        .ok_or_else(|| Error::FileMissing {
+    let mut trees = gix_hashtable::HashMap::default();
+    let blamed_file_entry_id = find_path_entry_in_commit(
+        &odb, &suspect, file_path, &mut buf, &mut buf2, &mut stats, &mut trees,
+    )?
+    .ok_or_else(|| Error::FileMissing {
         file_path: file_path.to_owned(),
         commit_id: suspect,
     })?;
@@ -129,7 +132,7 @@ where
             .filter(|(id, _)| *id == suspect)
             .map(|(_, entry)| entry);
         if entry.is_none() {
-            entry = find_path_entry_in_commit(&odb, &suspect, file_path, &mut buf, &mut buf2, &mut stats)?;
+            entry = find_path_entry_in_commit(&odb, &suspect, file_path, &mut buf, &mut buf2, &mut stats, &mut trees)?;
         }
 
         let Some(entry_id) = entry else {
@@ -138,7 +141,7 @@ where
 
         for (pid, parent_id) in parent_ids.iter().enumerate() {
             if let Some(parent_entry_id) =
-                find_path_entry_in_commit(&odb, parent_id, file_path, &mut buf, &mut buf2, &mut stats)?
+                find_path_entry_in_commit(&odb, parent_id, file_path, &mut buf, &mut buf2, &mut stats, &mut trees)?
             {
                 let no_change_in_entry = entry_id == parent_entry_id;
                 if pid == 0 {
@@ -215,6 +218,7 @@ where
     // I don’t know yet whether it would make sense to use a data structure instead that preserves
     // order on insertion.
     out.sort_by(|a, b| a.start_in_blamed_file.cmp(&b.start_in_blamed_file));
+    stats.tree_cache_size = trees.len();
     Ok(Outcome {
         entries: coalesce_blame_entries(out),
         blob: blamed_file_blob,
@@ -432,19 +436,62 @@ fn find_path_entry_in_commit(
     buf: &mut Vec<u8>,
     buf2: &mut Vec<u8>,
     stats: &mut Statistics,
+    cache: &mut gix_hashtable::HashMap<ObjectId, gix_object::Tree>,
 ) -> Result<Option<ObjectId>, Error> {
-    let commit_id = odb.find_commit(commit, buf)?.tree();
+    let tree_id = odb.find_commit(commit, buf)?.tree();
     stats.commits_to_tree += 1;
-    let tree_iter = odb.find_tree_iter(&commit_id, buf)?;
-    stats.trees_decoded += 1;
 
-    let res = tree_iter.lookup_entry(
-        odb,
-        buf2,
-        file_path.split(|b| *b == b'/').inspect(|_| stats.trees_decoded += 1),
-    )?;
-    stats.trees_decoded -= 1;
-    Ok(res.map(|e| e.oid))
+    Ok(lookup_entry_id_in_tree(file_path, tree_id, odb, cache, buf2, stats)?)
+}
+
+pub fn lookup_entry_id_in_tree(
+    path: &BStr,
+    mut tree_id: gix_hash::ObjectId,
+    odb: impl gix_object::Find,
+    cache: &mut gix_hashtable::HashMap<ObjectId, gix_object::Tree>,
+    buffer: &mut Vec<u8>,
+    stats: &mut Statistics,
+) -> Result<Option<ObjectId>, gix_object::find::existing_object::Error> {
+    let mut path = path.split(|b| *b == b'/').peekable();
+    while let Some(component) = path.next() {
+        let tree = match cache.get(&tree_id) {
+            Some(tree) => {
+                stats.tree_cache_hits += 1;
+                tree
+            }
+            None => {
+                // TODO(borrowchk): use entry-API once it's possible to mutate and return.
+                let mut tree = odb.find_tree(&tree_id, buffer)?.to_owned();
+                tree.entries.sort_by(|a, b| a.filename.cmp(&b.filename));
+
+                cache.insert(tree_id, tree);
+                stats.trees_decoded += 1;
+                stats.tree_cache_misses += 1;
+                cache.get(&tree_id).expect("just put")
+            }
+        };
+
+        use gix_object::bstr::ByteSlice;
+        match tree
+            .entries
+            .binary_search_by(|e| e.filename.as_bytes().cmp(&component))
+            .ok()
+        {
+            Some(entry_id) => {
+                let entry = &tree.entries[entry_id];
+                if path.peek().is_none() {
+                    return Ok(Some(entry.oid));
+                } else {
+                    if !entry.mode.is_tree() {
+                        return Ok(None);
+                    }
+                    tree_id = entry.oid;
+                }
+            }
+            None => return Ok(None),
+        }
+    }
+    Ok(None)
 }
 
 /// Return an iterator over tokens for use in diffing. These usually lines, but iit's important to unify them
diff --git a/gix-blame/src/types.rs b/gix-blame/src/types.rs
index e0c8843b4..4bfc3540d 100644
--- a/gix-blame/src/types.rs
+++ b/gix-blame/src/types.rs
@@ -33,6 +33,12 @@ pub struct Statistics {
     /// The amount of blobs there were compared to each other to learn what changed between commits.
     /// Note that in order to diff a blob, one needs to load both versions from the database.
     pub blobs_diffed: usize,
+    /// TODO
+    pub tree_cache_hits: usize,
+    /// TODO
+    pub tree_cache_misses: usize,
+    /// TODO
+    pub tree_cache_size: usize,
 }
 
 impl Outcome {

Comment on lines 211 to 212
/// Convert each of the unblamed hunk in `hunks_to_blame` into a [`BlameEntry`], consuming them in the process.
/// `suspect` is expected to be present in the suspect-map in each [`UnblamedHunk`].
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change it appears the comment needs to be updated to reflect the fact that we don’t expect suspect to be in suspects any longer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch - it will be fixed next time I push.

Please take a look at the updated comment above, it's my journey of trying to make it faster. But Git still has pixie dust that I simply can't replicate. Somehow it avoids a lot of work probably while defying physics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, sounds like it’s time to turn to my annotated version of git blame again. Maybe, at some point, it will tell its secrets. :-)

Copy link
Member

@Byron Byron Dec 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really interesting if there were statistics about the object database. How many objects did it actually read? How long were typical pack delta chains?

The latter information is actually available in gitoxide, I just never wired it up

Either way, experiments with the pack-delta cache showed that the cache efficiency isn't getting better after a little more than 100MB, showing that everything that could be cached and reused was cached. One can go up to 2.1g pack delta cache, and one will still get the same amount of hits as before. So for this workload, one will have to read 116558 pack entries, and there is no way around it apparently.

update_tree_entry in Git takes 115ms, it's parsing tree entries. The respective version in gitoxide runs for 554ms. The key here seems to be skip_uninteresting, which consumes most of the time. I think a key is that it knows about the sort-order, so when it's past a particular entry it knows that a match cannot be found anymore.
gitoxide always has to search all entries in the tree, so spends a lot more time decoding everything in a tree.

The physics-defying thing still is that apparently it doesn't spend much time in the object database at all, something that costs gitoxide tremendous amounts of time.

Screenshot 2024-12-25 at 20 23 08

Edit: read_object_file_extended is what gets data from the object database. It just needs 158ms for everything! How is that possible? It's supposed to have read more than 100k objects. How many does it actually read?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, git seems to lazily decode tree entries (and thus maybe trees?). skip_uninteresting sequentially calls update_tree_entry which calls decode_tree_entry under the hood until a blob that matches the given path (and file mode as far as I can tell) is found. skip_uninteresting seems to go through the available entries roughly in alphabetical order. This is what I observed running a printf-instrumented version of git blame STABILITY.md. Now I’m going to have a look at a larger repository. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And decode_tree_entry appears to do almost nothing unless I’m missing something: https://github.com/git/git/blob/6258f68c3c1092c901337895c864073dcdea9213/tree-walk.c#L16-L47

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have some additional results, which I think explain the pixie-dust with science🧬 😁.

It's notable that this is not necessarily the total amount of objects that are fetched, but it's the objects that have to be read once from a pack.

…e-entry

This can apparently happen, and now we handle this case and keep looking
for the remaining blame entries.

Also, exit early when no work is left to be done.
@cruessler
Copy link
Contributor Author

I just 'fixed' it by assuming it's naturally possible that there is no entry for a suspect, so there is no need to fail. Now it runs through and produces the correct result. In the background I am collecting more data and blame every file in the Git repository. It takes a while, but my hope is that it won't panic anymore.

From what I remember that can happen when we pass blame to a single parent only out of more than one. traverse will still yield the commits that we know can’t contain a relevant change, and those commits don’t have an entry in suspects.

Also, set a fixed and higher pack-cache to double typical pack-decode
performance in mid-sized repositories.
Additionally, normalize the input path.
@Byron
Copy link
Member

Byron commented Dec 25, 2024

I think I have got it, the proof that Git is just doing significantly less, magically… .

GIT_TRACE=true \
GIT_TRACE_PACK_ACCESS=true \
GIT_TRACE_PACKET=true \
GIT_TRACE_PACKFILE=true \
GIT_TRACE_PERFORMANCE=true \
GIT_TRACE_SHALLOW=true \
GIT_TRACE_SETUP=true \
/usr/bin/time -lp git blame t/t4013/diff.rev-list_--children_HEAD 2>&1 | rg packfile.c | wc -l
   27963

It reads only 28k objects in total, whereas we read a whopping 78553 unique objects, and fetch a total of 227k objects! This must be largely due to the commit graph traversal, which definitely has to be done internally to get the same performance.

That's 3x more objects fetched, which costs time to do but also costs extra time to parse and handle. So I am sure that if we woudl read not more objects than Git, we'd be as fast or faster.

@cruessler
Copy link
Contributor Author

One reason we do extra work is in the case where a commit has multiple parents and we pass blame to just one of them. Even if we didn’t have to follow the other ancestors, this is a fact that traverse does not know about, so it still yields commits that we know won’t contain relevant changes. It would be interesting to verify that hypothesis in a repository that has a strictly linear commit history.

@Byron
Copy link
Member

Byron commented Dec 25, 2024

Now it's clear that Git asks for 41813 objects total, and 20864 of these are unique.

❯ GIT_TRACE=true \
GIT_TRACE_PACK_ACCESS=true \
/usr/bin/time -lp ./git blame t/t4013/diff.rev-list_--children_HEAD 2>&1 | rg object-file.c | awk '{ print $3 }' | wc -l
   41813

git ( master) +205 -237 [!?] via 🐍
❯ GIT_TRACE=true \
GIT_TRACE_PACK_ACCESS=true \
/usr/bin/time -lp ./git blame t/t4013/diff.rev-list_--children_HEAD 2>&1 | rg object-file.c | awk '{ print $3 }' | sort -u | wc -l
   20864

I am now running a script to count the occurrences of the type of each objects - that should make clear how many commits it sees, for example, and from there the trees are derived. Avoid full diffs by doing a path-based lookup first is exactly what Git does, even though Git has integrated that into the diff itself.

And here is the result:

   1 blob
   2 commit
20861 tree

Erm, probably I am missing commits, let's retry.

@Byron
Copy link
Member

Byron commented Dec 25, 2024

And I have another reason for the performance: Git can get the tree of a commit from the commit-graph itself. This is where we have to read each commit even though the traversal might get the data from the commit-graph.

Now that I have deleted the commit-graph, it also reads all commits like it should.

Now there are 35k unique objects:

❯ wc -l ./objects.list
   35664 ./objects.list

So I'd guess it's 15k commits overall that are part of the traversal. Quite some less compared to the 58k that we have to go through.

Once we can do what Git can, I think we will be just as fast, at least as long bloom filters aren't playing a role :D.

And here are the results:

❯ xargs -n1 git cat-file -t < objects.list | sort | uniq -c
   1 blob
14802 commit
20861 tree

Ok, so it's clear that it also very efficiently only reads the one blob that adds that file 15k commits ago. Very good, and something that we'd have to match to be competitive. Probably that's the top priority before implementing any additional features.

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congratulations! This is great work, and I really love how the core of the algorithm is basically perfect. All I could do here was to refactor what surrounds it, but the key to blaming is well-tested and well understood, implemented in a way that doesn't even show up in the profiler.

Thanks again for all your time - I sure hope there is more so gix blame can get as fast as Git, and faster :).

@Byron Byron enabled auto-merge December 25, 2024 20:37
@Byron Byron merged commit 6ed9976 into GitoxideLabs:main Dec 25, 2024
20 checks passed
@cruessler
Copy link
Contributor Author

Congratulations! This is great work, and I really love how the core of the algorithm is basically perfect. All I could do here was to refactor what surrounds it, but the key to blaming is well-tested and well understood, implemented in a way that doesn't even show up in the profiler.

Thanks again for all your time - I sure hope there is more so gix blame can get as fast as Git, and faster :).

That is amazing! I’m really happy it turned out this way! Thanks a lot for the feedback, looking forward to my next PR! 😀

@izuzak
Copy link

izuzak commented Dec 25, 2024

Woohoo! 🎉 I've been following this work from the sidelines and excited to see it land. Great collaboration @cruessler and @Byron! ✨

@Byron
Copy link
Member

Byron commented Dec 27, 2024

Session Notes

  • gix-commitgraph already has ` access to the root-tree-id
  • this method in gix-traverse

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants