Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gix log #1643

Merged
merged 2 commits into from
Dec 21, 2024
Merged

Add gix log #1643

merged 2 commits into from
Dec 21, 2024

Conversation

cruessler
Copy link
Contributor

This PR adds gix log with functionality similar to git log --oneline. It seems to already work in the majority of cases, but there are still issues to flesh out with respect to how renames are handled. Nevertheless, I wanted to open the PR early, in order to get any feedback on the general approach you might already have at this point.

I noticed that there is quite an overlap conceptually with how commit walks are done for blames, so this might be an area to share some code at some point.

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your contribution and for getting this started.

For fun, I did try it and it worked :)!

❯ cargo run --bin gix -- log src/gix.rs
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.46s
     Running `target/debug/gix log src/gix.rs`
0b5dc744 Fix compilation errors
202f3e48 Refactor: Make it easier to compile gitoxide as dynlib
4ef9a323 use a multi-call binary ('uni') to have only one build step (#987)
518159da make top-level `gix` docs available when using `cargo doc`
73372d0b fix: re-enable local-time support for all binaries
3ddbd2de feat: Make `reqwest` TLS backend configuration easy.
b1c40b03 feat: use docsrs feature in code to show what is feature-gated automatically on docs.rs
f7f136db chore: uniformize deny attributes
6da82507 Merge branch 'main' into rev-parse-delegate
faaf791c make obvious what plumbing and porcelain really are (#331)
c3d319f2 Initial commit - based on standard project template

I thought I could leave some comments despite this PR being in its early stages, hoping that's alright.

gitoxide-core/src/repository/log.rs Outdated Show resolved Hide resolved
gitoxide-core/src/repository/log.rs Outdated Show resolved Hide resolved
gitoxide-core/src/repository/log.rs Outdated Show resolved Hide resolved
gitoxide-core/src/repository/log.rs Outdated Show resolved Hide resolved
gitoxide-core/src/repository/log.rs Outdated Show resolved Hide resolved
gitoxide-core/src/repository/log.rs Outdated Show resolved Hide resolved
src/plumbing/options/mod.rs Outdated Show resolved Hide resolved
src/plumbing/options/mod.rs Outdated Show resolved Hide resolved
@cruessler
Copy link
Contributor Author

I discovered a behavior in bisect_entry that I found unexpected, but I also might have misunderstood the function. Calling bisect_entry(".github/dependabot.yml", false) inside the gitoxide repository returned an entry that had filename: ".github", so it seemed to be referencing the directory instead of the file. You can try yourself by adding dbg!(&info, entry); before or after calls to write_info.

Apart from that, I have incorporated your suggestions, so thanks for all of them!

@cruessler
Copy link
Contributor Author

This is the script I use to compare the output of gix log and git log --oneline: https://gist.github.com/cruessler/12ed15763fad3754364eaa0659a00ba2.

@Byron
Copy link
Member

Byron commented Oct 26, 2024

I discovered a behavior in bisect_entry that I found unexpected, but I also might have misunderstood the function.

Actually, I noticed the misuse too, but when I tried it, it (probably accidentally) worked so I thought that maybe you are just exploiting some property of it very cleverly.

What's happening here is that bisect_entry works with names only, not paths. In order to check if a path exists in a tree, recursively, lookup_entry() must be used. And this method will be much more expensive as it has to drill down into the object tree, once for each commit. It's an ouch for performance, but I also see no other way.

@cruessler
Copy link
Contributor Author

Quick update: I hope this PR is mostly done. Comparing the output of gix log and git log --oneline revealed a few cases, though, where gix log show more commits that touched a given file than git log --oneline. I don’t have a good theory yet on why this is the case, but once I have a couple of hours uninterrupted at my hand I’m going to have a closer look.

@Byron
Copy link
Member

Byron commented Nov 1, 2024

Thanks for the update!

It's interesting that gix log shows more commits, instead of less, as I don't think it performs rename tracking here. In theory, one would have to run a real diff on the trees to notice renames, while constraining the diff to only look at interesting paths (via pathspec). I think I have seen such capabilities in Git, it's quite astonishing, and very well optimized.

@cruessler
Copy link
Contributor Author

@Byron Is there any code in gitoxide that I can look at for inspiration when it comes to rename tracking? Otherwise, I would peek into my local git clone. :-)

Also, I assume it makes little sense to land this feature without rename tracking. What do you think?

And another also: since there seem to be more cases to consider than I initially thought, I would like to add a couple of tests. Would this require moving the code I’ve written so far elsewhere?

@Byron
Copy link
Member

Byron commented Nov 5, 2024

Thanks for asking, happy to help.

In order to get rename-tracking one would have to use diff_tree_to_tree and diff entire trees. It's something I thought would be used in the blame implementation already, but probably it's optional there while it's OK for the blame to stop at the rename-boundary (i.e. when a file gets added into existence).

And another also: since there seem to be more cases to consider than I initially thought, I would like to add a couple of tests. Would this require moving the code I’ve written so far elsewhere?

This is a harder question for me. The only way to test the CLI is just journey-tests, but it's something I stopped using as I stopped considering the CLI a deliverable entirely. Declaring it a dev-tool means it works if it works for the developer in that moment, and there is nothing to pin down.

In order to implement this correctly, one would have to add some sort of pathspec filter to the existing commit-walk via gix::Id::ancestors() or gix::Repository::rev_walk(). git2 doesn't seem to have this feature or I couldn't find it, so only Git knows how to implement that so it's fast. The naieve implementation would be something like "for each commit that is about to be returned, get a diff with the first (?) parent to see if the pathspec matches any of the changed paths. It's quite a huge feature on its own to make it fast, but inherently the same problem that blame has.

Probably something to exhange more thoughts on, not knowing git blame nor git log probably makes me throw up a lot of questions on my own rather than answering yours 😅.

@cruessler
Copy link
Contributor Author

cruessler commented Nov 8, 2024

Thanks for the hint! I’ve now done a bit of research and can provide another quick update: git log --oneline does not follow renames by default. See e. g. the output of git log --oneline --patch .github/dependabot.yml which has the file added, removed, then re-added and changed.

Also, git does something that is called “History simplification” in the docs. This a process by which some commits are potentially removed from the output of git log, depending on the flags given. I think it makes sense to try and replicate what git does by default, and am going to try and implement that as a next step.

At this point, I have learned that gix blame still needs rename-tracking, but that it also seems to use a slightly different algorithm for its rev-walk, so I don’t have to wait for the gix log-PR to be complete before finishing the gix blame one.

Also, your comment was really helpful as it made me dig deeper. So thanks for that! 😄

@Byron
Copy link
Member

Byron commented Nov 9, 2024

Thanks a lot for sharing - I had no clue!

But wouldn't that make gix log useless at least when it was supposed to support debugging/understanding gix blame? In my mind, history simplification or not, it was always just a rev-walk with a certain display of the commits it sees. (Of course, turning that into a graph for displaying lanes is something else once again…).

So my recommendation, assuming the assessment is correct, would certainly be to drop gix log and instead put the focus on rename-tracking + quality improvements of gix blame.
Lastly, of course if gix log is fun, by all means, go for it :).

PS: I will have to keep working on gix merge for a while, but when that is done I plan to take time to review gix blame - I expect it to be very time-consuming as my goal always is to understand it to the point where I could have written it, and to scrutinise every line so I can maintain it properly. So I hope it all happens this year :).

PS2: When rename tracking is done, it's going to be even more expensive and I was thinking that one could even record all the paths that are touched in memory, similar to ein tool query, but in memory. With that one would do the hard work once and the blame could then be done much more swiftly as it degenerates to : here is a list of blobs that are changes to the file we care about, along with the commits they were changed in, so go and do your thing. And if that 'here is the commit-blob associations' would be streamed into the algorithm, it could eventually complete its work whlie the graph traversal is going. This would have the benefit that it can trivially be split into 1 + N threads, where 1 is doing the traversal + tree-diffing, the other N are doing blames on paths of interest. And if one of these N threads would record the data, then future blames would basically be instant as the set of blob/commit associations is already known.

@Byron Byron marked this pull request as ready for review December 21, 2024 19:13
cruessler and others added 2 commits December 21, 2024 20:14
It's primarily meant to better understand `gix blame`.
For now, let's ignore file-based logs as it would take a moment to make it fast enough.
Git can diff with pathspecs, which is probably something that would be needed to be fast here.
In any case, more research would be needed to be competitive in performance.

Maybe one day this will be re-added even in its current form to help with `gix blame` debugging,
but I'd hope that there will be better ways to validate it.
@Byron
Copy link
Member

Byron commented Dec 21, 2024

Thanks a lot for getting this started!

In order to be able to merge it, I had to remove the pathspec based version as it isn't really doing what I think it would have to do to match Git behaviour - it's probably not easy to implement but will be great to have one day.

@Byron Byron merged commit 29cb775 into GitoxideLabs:main Dec 21, 2024
20 checks passed
@cruessler
Copy link
Contributor Author

Am I right in assuming that single file history would also greatly benefit from a Bloom filter? This was one of the reasons I started working on gix log: learning how history traversal works. Now I’d also like to use it as an opportunity for getting to know how to make it fast unless gix blame would be a better testing ground. What do you think?

@Byron
Copy link
Member

Byron commented Dec 24, 2024

I took another look and have a strong feeling that blame is doing its very own, manual traversal. They have a prio-queue that keeps commits sorted. The normal commit traversal in gix-traverse also has something similar, but it might be that the implementation in blame is slightly different.
In other words, it seems that implementing the traversal within blame will be able to cut a lot of the time. After all, more than 50% of each query is spent in the strange indegree computation of the topo-traversal, and that would go away entirely then. Shouldn't be too hard either.

And then I think one can compare both algorithms based on without commit-graph available, as that disables the bloom filter as well. In theory, they would be just as fast.

Please note that I will also be adding some statistics, just like git blame --show-stats so it's easier to learn more where time is spent. Also I will add some spans so gix --trace --verbose will output interesting performance information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants