-
-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore gix APIs, experiment with gix-blame API #1453
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great to see this PR and I am looking forward to witnessing it evolve over time :)!
git commit -q --allow-empty -m c5 | ||
git tag at-c5 | ||
git merge branch1 -m m1b1 | ||
echo "line 2" >> file.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd do the same - start extremely simple, maybe do the first rough implementation so it passes this test, and then think of some tougher cases to throw at it, validating that they still come out right.
Maybe it's worth investing into a baseline test which parses the output of git blame
to get the expected output, which then has to be matched by the algorithm. These are typically the second stage as they are more obscure, but make it easier to get correct expectations for a lot of possibly complex inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the input, sounds like a great plan!
@Byron I think I’ve now gotten to a point where I’ve learned enough about My goal now is to write a test that runs a complete blame for the sample repo that is part of this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, you already dug deep into the most complex APIs gitoxide
has to offer, I am pretty sure it can't get much worse than that. Having it, however, helps to know what data is needed which will help tremendously with defining the API later.
I think continuing with the test to try and complete the algorithm would be good, even though I'd probably start with refactoring the 'setup' portion of the test a little more to be able to focus on the blame algorithm.
But that's details, I think you are on the right track and this PR already does everything that's needed to do a blame.
Once the core algorithm is proven, as most naive and simple version of course, I'd probably create a baseline of some samples with git
itself, parse the result and use it as expectation for the algorithm created here.
I hope that helps.
gix-blame/tests/blame.rs
Outdated
|
||
use gix_diff::blob::intern::Token; | ||
use gix_hash::ObjectId; | ||
use gix_odb::pack::FindExt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd want to use the equally named gix_object::FindExt
, it's easier to use.
Thanks, that’s very helpful already! I think I wanted to mostly make sure I’m still on the right track and not missing anything. I’ll continue working on the test, then, and worry about the public API later. |
Yes, I think it's great to keep up the explorative approach using |
@Byron This is a quick status update! The algorithm is now able to generate a blame for the sample repo. I extended the test to compare this blame against |
@Byron Sooner than expected, there seems to have been quite a bit of progress! I extended the test repo so that it now also contains commits that don’t touch the file being blamed. I also added a file that contains multiline hunks. There’s now 2 test cases that cover one file each. Also, most of the logic is now inside the |
That's great news!
That's interesting - I just know that Git usually ignores merge-commits, but I don't know if it includes them in blames nonetheless. I never thought about how that would work, and would probably see what Git would do. It's something I do in general, that is, try to understand what Git can do and ask if this would (eventually) be possible with the Besides that, I definitely recommend to keep clippy happy and run it locally at least occasionally - ideally CI can stay green even during early development, and I found it worth it as it typically leads to cleaner code. Often I find myself doing a push&run, just to return one more time to quickly fix CI if it breaks so I can continue 'clean' in the morning :D. |
@Byron I think I’m now at a point where I’d appreciate a brief sync or a bit of feedback on how to best continue. Overall, I’m happy with the results so far. I’ve tested the implementation on linear commit histories with up to 90 commits, and for the files that I’ve blamed, the results match These are some observations and remaining issues:
This is a bit of a braindump and it probably is missing some important context. I hope that is’s useful anyway and am happy to elaborate on anything as needed. :-) |
Also, there’s errors in the CI pipeline, but they don’t seem to be related to any change in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I had a lot of fun working on this PR, so thanks for giving me the space! :-)
I am glad to hear that, and love the progress you made. This is the PoC already, proving that it generally works - awesome!
@Byron I think I’m now at a point where I’d appreciate a brief sync or a bit of feedback on how to best continue. Overall, I’m happy with the results so far. I’ve tested the implementation on linear commit histories with up to 90 commits, and for the files that I’ve blamed, the results match
git
’s results.
I could also imagine to work towards getting a bare-bones version of gix blame
quickly so that the API can already be evolved to make the non-interactive case usable, while also opening it up for more manual and performance tests.
- Merges still need to be taken into account, in the sense that while
git
indeed seems to ignore them, my implementation currently only follows one parent instead of all parents.
As gix-blame
is on a plumbing level, it wouldn't have to care about this at all. However, upper layers will make decisions about that, and these typically align with Git when feasible.
- There seem to be differences between
imara-diff
andgit diff
. I want to do more research on this topic as it directly affects the result of a blame.
Once you found one and can reproduce it with a simple diff, we would make it reproducible and work on fixing imara-diff
. Please note that higher-levels also configure the diff algorithm correctly, so git blame
might be different if it picks up a different value for diff.algorithm
. But it's very possible that the issue is legitimate and it will be an ultra-high-priority fix for me.
- Running the current implementation on even slightly larger repositories (I ran some tests inside
gitoxide
) is slow. I suspect this is due to the part where two trees are diffed to find out whether the file under blame was changed. I might also just have missed an API that does exactly what I want.
Tree-diffs need object-caches setup to be competitive, and also need the best possible zlib backend (zlib-ng
). Both are easy to miss.
- Some of the tests are quite verbose, there’s probably opportunities for some de-duplication using macros.
I did some of that, it looks less noisy now. To me it seems fine, particularly because there only appear to be two hard todo!()
s left which I'd think you can lure out with two more of these tests. So it's unlikely these more verbose tests keep growing forever.
- Most of the recent fixes didn’t add test cases to
make_blame_repo.sh
, but I instead added very specific test cases likeprocess_change_works_added_hunk
that only test a single iteration of the basic building block.
I love that.
- The implementation doesn’t follow renames.
It's rename tracking that is setup by the higher-level. But probably blame also needs to know about it, and eventually that should probably also be implemented.
However, it can wait until the API is a bit more fleshed out.
- I even found a difference between
git
andlibgit2
when it comes to coalescing blame entries before returning them to the caller. I’ve documented that fact in a comment.
That's a great catch and the level of diligence that gitoxide
requires. Of course, ideally it's not applied as post-process, but we might get there in stage three: make it work fast, now it's stage one: make it work.
Next Steps
Whenever you feel ready, I recommend to address the TODO's I left in Cargo.toml
and adjust the API to not do any of the initialization work itself.
From there, my next step would be to make it usable in gix blame
(even without wiring it up in gix
(the crate) for now), so be able to validate the API for non-interactive cases. What's most important, and something that's not tested at all right now, is that having a gix blame
will make clear how it shall be possible for callers to access actual lines of code.
To do that correctly, they'd need access to the original interner and the tokens behind the ranges that are used in the blame. It's really important to keep that in sync, and avoid making the impression that the caller should re-split blobs into lines.
I hope that helps to continue, please feel free to ask follow-up questions.
PS: I am very happy with the level of independence of your work - this review and refactor took me a little more than 2 hours, which is on the low-end for the stage the feature is in (i.e. it's seemingly nearly working correctly).
Thanks for the review and the feedback, that’s very helpful! I’ll have a look and continue working or ask follow-up questions. :-) |
It's going to be good, thanks again for all your work on this! My greatest hope is that you keep having fun :).
Maybe re-using the tokens isn't feasible after all as each diff creates a new set of them. The interner can keep all of the tokens, probably a good thing also for performance, but I wonder if it's feasible to keep all tokens around. In the end, each one is only a Lastly, my feeling here is that it would be best if tokens would be made available on a per-blob basis so that a single interner can keep all of these tokens for later access. But the above are just thoughts, |
@Byron I think I was able to identify a case where Before:
After:
The difference is with respect to the newlines surrounding the second |
That's great! Could this be reproduced with a baseline test, maybe also producing versions with Meyers and Histogram respectively? Before reeling in Pascal who can most certainly enlighten us, I'd love to have a failing test along with all the facts that we can collect. Thanks again for your help with this. |
I just added two failing test cases. Since |
I’ve also moved most of the setup out of |
Great, thanks, I will take a closer look later.
Yes, that's exactly it. I thought traversal can be reduced to an iterator yielding object-ids, but it's nothing I validated and I am probably missing an important detail. |
The issue might be that we need to take an The dependency on |
I took a closer look and have factored out the remaining bits that can be removed in a straightforward fashion. Maybe that's mostly it already. The test-case is wonderful by the way as it should be straightforward to narrow down the cause of the issue, or at least get to a point where Pascal can be invited. |
I think cry for help it is, @pascalkuthe 😅. After applying this patch: diff --git a/gix-blame/src/lib.rs b/gix-blame/src/lib.rs
index a737965a6..c4ec75cfa 100644
--- a/gix-blame/src/lib.rs
+++ b/gix-blame/src/lib.rs
@@ -815,6 +815,13 @@ pub fn blame_file<E>(
let outcome = resource_cache.prepare_diff().unwrap();
let input = outcome.interned_input();
+ for token in &input.before {
+ eprintln!("{:?}", input.interner[*token].as_bstr());
+ }
+ dbg!("after");
+ for token in &input.after {
+ eprintln!("{:?}", input.interner[*token].as_bstr());
+ }
let number_of_lines_in_destination = input.after.len();
let change_recorder = ChangeRecorder::new(number_of_lines_in_destination.try_into().unwrap());
let changes = gix_diff::blob::diff(gix_diff::blob::Algorithm::Histogram, &input, change_recorder);
diff --git a/gix-blame/tests/blame.rs b/gix-blame/tests/blame.rs
index 437e3de59..039005e9f 100644
--- a/gix-blame/tests/blame.rs
+++ b/gix-blame/tests/blame.rs
@@ -219,7 +219,7 @@ mktest!(same_line_changed_twice, "same-line-changed-twice", 2);
mktest!(coalesce_adjacent_hunks, "coalesce-adjacent-hunks", 1);
#[test]
-#[ignore = "TBD: figure out what the problem is"]
+// #[ignore = "TBD: figure out what the problem is"]
// As of 2024-09-24, these tests are expected to fail.
//
// Context: https://github.com/Byron/gitoxide/pull/1453#issuecomment-2371013904 it shows the following:
From this I see that the tokens are exactly what's expected as the basis for a The baseline for the expected result is the following. It's well balanced and also quite intuitive. However, our result seems off: Diff < left / right > :
[
BlameEntry {
range_in_blamed_file: 0..2,
range_in_original_file: 0..2,
commit_id: Sha1(0b56209a29a36af25c845a93f28ce35c1ac3de09),
},
BlameEntry {
range_in_blamed_file: 2..4,
range_in_original_file: 2..4,
commit_id: Sha1(6d23453055b4a855092a0be7b3e4741414c1e547),
},
BlameEntry {
< range_in_blamed_file: 4..5,
< range_in_original_file: 2..3,
> range_in_blamed_file: 4..6,
> range_in_original_file: 2..4,
commit_id: Sha1(0b56209a29a36af25c845a93f28ce35c1ac3de09),
},
BlameEntry {
< range_in_blamed_file: 5..7,
< range_in_original_file: 5..7,
> range_in_blamed_file: 6..8,
> range_in_original_file: 6..8,
commit_id: Sha1(6d23453055b4a855092a0be7b3e4741414c1e547),
},
BlameEntry {
< range_in_blamed_file: 7..9,
< range_in_original_file: 3..5,
> range_in_blamed_file: 8..9,
> range_in_original_file: 4..5,
commit_id: Sha1(0b56209a29a36af25c845a93f28ce35c1ac3de09),
},
] Just to be sure that it's not the hunk-recorder that's the actual foundation for the blame, I also printed the hunks that it finds.
They seem to fit. I'd be happy to investigate this further, but wanted to ask you, @pascalkuthe , if you have any ideas to guide this. Maybe it's super-obvious to you what's going on. Something I can imagine is happening here is that blame does something special beyond the normal diffing, I myself didn't have any look at it yet. Thanks for sharing your thoughts. |
at first glance this looks like the slider problem (https://github.com/mhagger/diff-slider-tools/blob/master/README.md). Basically the algorithm can say that your diff is:
or it can say that the diff is:
Both are correct (minimal) difss so there is nothing wrong with the diff algorithm/results it's just that there are multiple different possible choices a diff can make. Git has some postprocessing heuristics to make a choice but from an objective point of view one choice isn't wrong just that some choices feel more intuitive to humans so it's not a bug in the diff algorithm just a missing feature. Git resolves those ambiguities by a heuristic. Basic summary:
I have an implementation of the first two coupled with some (necessary) API change that was aimed to be the first breaking imara-diff change. I started porting helix to the new API but got a bit stuck timewise and didn't get around to finishing that. The third heuristic is very involved implementation wise and also difficult to add from an API perspective since it only makes sense for (line-segmented) text but imara-diff operates on abstract tokens and allows diffing arbitrary data (segmented in arbitrary ways). I think we could aim to get the basic heuristic in, the more advanced indentation based code I am not sure I have time to work on at the moment. For this PR I would suggest to just update the fixtures to match whatever imara-diff produces and then update them as imara-diff improves in that regard. |
Quick side note to my last commit: I’ve had to add another parameter to |
Thanks so much for sharing! I also do remember now that you already shared this with me, probably in person, so it's clear that these heuristics are a tuneable and there isn't really a right answer. After reading at the link location really makes me think that
Thus, this is likely going to be a place where Until then, we go with your suggestion and just keep these examples in our baselines and mark them accordingly. Maybe one day they fit, or one day we mark ours as preferred. PS: I really can't wait for these heuristics to be implemented, can't wait for |
@Byron This is another quick status update: I’ve now refactored and simplified the implementation a bit, mostly to make the concepts used more easy to discern. I’m also quite far already in my approach to being able to blame commits with more than one parent. So, from my side, the point gets closer where I would call the first implementation done. For context, I just wanted to ask for feedback, in particular on how and when you think this should be merged. I’m mainly asking because there’s now more than one area I’d like to focus on, and I can imagine it might be easier to spread the work into PRs that are more specific to certain aspects, instead of keeping everything in this one. I can still follow it, but it’s already very layered in the sense that there’s quite a few discussions and threads overlapping. :-) What are your thoughts? I’m also fine with continuing in this PR, so whatever works best for you! |
Thanks for the update :). Generally I am happy to follow your preference when it comes to merging, so what follows is just my preference, which isn't the deciding factor here. What I'd do before merging is to 'quickly' wire it up into And since you probably have some more areas of work already in your mind, I think it would be good to collect them in a task list or tracking issue, so we both can keep an overview. It's probably best if you'd create a tracking issue for that so you can edit it primarily. In summary, a tracking issue we should have, and everything else is up to you then - you can ask me to review and merge, and that's what I will do, with or without I hope that helps. |
Thanks, that helps a lot! I think I’ll give wiring it up into |
62d3b11
to
b2bd2cb
Compare
Quick question: did I put the code for wiring up |
They are used with inverse meaning compared to the current documentation. It's easier to adjust the variable names.
And it's nearly there, all that seems left is a more standardized handling of the input path in the CLI. When benchmarking/testing more, I came across this gem as it appears to be possible to try to an unblamed hunk into a blame entry without the suspect being present there. This occurred in the Git repository itself.
The When 'looking for trouble' with Your help would be appreciated, as I didn't really get to grasping how suspects really work. But I hope to get there. |
I just 'fixed' it by assuming it's naturally possible that there is no entry for a suspect, so there is no need to fail. Now it runs through and produces the correct result. |
I am still on solidifying the implementation, and running it on every file in the Git repository is a good way to run into the last remaining .expect()s. It's also a great motivation to try improving performance, as it takes a while otherwise. Just to share early, I added a counter to see if it's worth it to cache information, and by the looks of it, this really is the case.
Just need to figure out if it's trees or commits that should be cached additionally, hoping that it is trees mostly. Edit: it definitely runs into every single assertion :D - the real world knows no mercy. |
And a pretty good performance example (without commit-graph/bloom filter) where Git is 7.5x faster:
However, I think the problem is that we have tremendous uncached amounts of work to find the one commit that adds this file 55k commits ago. Git clearly manages to not redo much work. |
Based on this worst case…
…I kept tinkering and got it to this point.
I also know for sure that it takes 2.45s to decode 78494 trees for the first time and there is no way around that.
The above is a version that aggressively cached each tree - it was slower overall (despite unoptimized) but the cache isn't much faster than what's already implemented, and the slow thing is accessing the 58k root trees, along with their unique subtrees to get to the file in question. The ever-growing tree-cache was about 3GB in size as well (with everything, unoptimzed) so absolutely not worth it. Once again I am at a point where pack-access is slow, and object-caching isn't effective as it needs new objects. There is pack-entry caching which cuts delta-chains short. By default, the cache isn't very effective but so fast that it's still worth it in all the cases.
It's possible to make the cache much larger…
…which brings the time down to 1.73s down from 3.04s, so it's pretty good. However, it's still a far cry from Git which finishes this in 0.43s, 4x (!!). With the improved pack-cache, And within that function, about 60% go into tree decoding, 20% each go into finding the commit and then the root tree of the commit. Finally, let's compare this with Git: It even does full diffs all the time, but manages to skip over most. This is the moment where I realise that it picked up the commit-graph and bloom filters… . Without the commit-graph, it looks like this: So… never mind, it's still ludicrously fast and I don't know how this is even possible? It has to access all these commits and trees and can't skip, especially without bloom filters. But it's still this fast? The commit that will never be
|
gix-blame/src/file/function.rs
Outdated
/// Convert each of the unblamed hunk in `hunks_to_blame` into a [`BlameEntry`], consuming them in the process. | ||
/// `suspect` is expected to be present in the suspect-map in each [`UnblamedHunk`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change it appears the comment needs to be updated to reflect the fact that we don’t expect suspect
to be in suspects
any longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good catch - it will be fixed next time I push.
Please take a look at the updated comment above, it's my journey of trying to make it faster. But Git still has pixie dust that I simply can't replicate. Somehow it avoids a lot of work probably while defying physics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, sounds like it’s time to turn to my annotated version of git blame
again. Maybe, at some point, it will tell its secrets. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be really interesting if there were statistics about the object database. How many objects did it actually read? How long were typical pack delta chains?
The latter information is actually available in gitoxide
, I just never wired it up
Either way, experiments with the pack-delta cache showed that the cache efficiency isn't getting better after a little more than 100MB, showing that everything that could be cached and reused was cached. One can go up to 2.1g pack delta cache, and one will still get the same amount of hits as before. So for this workload, one will have to read 116558 pack entries, and there is no way around it apparently.
update_tree_entry
in Git takes 115ms, it's parsing tree entries. The respective version in gitoxide
runs for 554ms. The key here seems to be skip_uninteresting
, which consumes most of the time. I think a key is that it knows about the sort-order, so when it's past a particular entry it knows that a match cannot be found anymore.
gitoxide
always has to search all entries in the tree, so spends a lot more time decoding everything in a tree.
The physics-defying thing still is that apparently it doesn't spend much time in the object database at all, something that costs gitoxide
tremendous amounts of time.
Edit: read_object_file_extended
is what gets data from the object database. It just needs 158ms for everything! How is that possible? It's supposed to have read more than 100k objects. How many does it actually read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell, git
seems to lazily decode tree entries (and thus maybe trees?). skip_uninteresting
sequentially calls update_tree_entry
which calls decode_tree_entry
under the hood until a blob that matches the given path (and file mode as far as I can tell) is found. skip_uninteresting
seems to go through the available entries roughly in alphabetical order. This is what I observed running a printf
-instrumented version of git blame STABILITY.md
. Now I’m going to have a look at a larger repository. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And decode_tree_entry
appears to do almost nothing unless I’m missing something: https://github.com/git/git/blob/6258f68c3c1092c901337895c864073dcdea9213/tree-walk.c#L16-L47
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also have some additional results, which I think explain the pixie-dust with science🧬 😁.
It's notable that this is not necessarily the total amount of objects that are fetched, but it's the objects that have to be read once from a pack.
…e-entry This can apparently happen, and now we handle this case and keep looking for the remaining blame entries. Also, exit early when no work is left to be done.
From what I remember that can happen when we pass blame to a single parent only out of more than one. |
Also, set a fixed and higher pack-cache to double typical pack-decode performance in mid-sized repositories. Additionally, normalize the input path.
I think I have got it, the proof that Git is just doing significantly less, magically… .
It reads only 28k objects in total, whereas we read a whopping 78553 unique objects, and fetch a total of 227k objects! This must be largely due to the commit graph traversal, which definitely has to be done internally to get the same performance. That's 3x more objects fetched, which costs time to do but also costs extra time to parse and handle. So I am sure that if we woudl read not more objects than Git, we'd be as fast or faster. |
One reason we do extra work is in the case where a commit has multiple parents and we pass blame to just one of them. Even if we didn’t have to follow the other ancestors, this is a fact that |
Now it's clear that Git asks for 41813 objects total, and 20864 of these are unique.
I am now running a script to count the occurrences of the type of each objects - that should make clear how many commits it sees, for example, and from there the trees are derived. Avoid full diffs by doing a path-based lookup first is exactly what Git does, even though Git has integrated that into the diff itself. And here is the result:
Erm, probably I am missing commits, let's retry. |
And I have another reason for the performance: Git can get the tree of a commit from the commit-graph itself. This is where we have to read each commit even though the traversal might get the data from the commit-graph. Now that I have deleted the commit-graph, it also reads all commits like it should. Now there are 35k unique objects:
So I'd guess it's 15k commits overall that are part of the traversal. Quite some less compared to the 58k that we have to go through. Once we can do what Git can, I think we will be just as fast, at least as long bloom filters aren't playing a role :D. And here are the results:
Ok, so it's clear that it also very efficiently only reads the one blob that adds that file 15k commits ago. Very good, and something that we'd have to match to be competitive. Probably that's the top priority before implementing any additional features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Congratulations! This is great work, and I really love how the core of the algorithm is basically perfect. All I could do here was to refactor what surrounds it, but the key to blaming is well-tested and well understood, implemented in a way that doesn't even show up in the profiler.
Thanks again for all your time - I sure hope there is more so gix blame
can get as fast as Git, and faster :).
That is amazing! I’m really happy it turned out this way! Thanks a lot for the feedback, looking forward to my next PR! 😀 |
Woohoo! 🎉 I've been following this work from the sidelines and excited to see it land. Great collaboration @cruessler and @Byron! ✨ |
Session Notes
|
This is a draft PR that is supposed to help me get familiar with
gix
’s APIs as well as to experiment with possible APIs forgix-blame
. Any feedback that helps me not go down the wrong path, is much appreciated! :-)Review Tasks
gix blame
for funnot yet implemented: passing blame to more than one parent is not implemented yet
onREADME.md
:/todo!()
.unwrap()
with error handlingOutcome
with statisticsblamed_file
andoriginal_file
once and for all.gix merge-file
as well)Performance Opportunities
Cache tree lookups - this can cut the amount of ODB lookups in half at least, with tree-lookups being the slowest. Maybe such a cache could just be added togix-object
directly so everybody can use it.Commit to track object access