Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extreme memory usage when running it on the linux kernel repo with cliff.toml from this project #1

Closed
Byron opened this issue Aug 12, 2021 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@Byron
Copy link

Byron commented Aug 12, 2021

Describe the bug
When running it on https://github.com/torvalds/linux with the cliff.toml from this repository, the git-cliff process will take a lot of time and consume more and more memory. I had to stop it at 12GB.

To Reproduce
Steps to reproduce the behavior:

git clone https://github.com/torvalds/linux
cp cliff.toml ./linux/
cd linux
git cliff

Expected behavior
A log is produced in reasonable time.

System (please complete the following information):

Ran f1b495d on MacOS with 8GB of RAM and M1

@Byron Byron added the bug Something isn't working label Aug 12, 2021
@orhun
Copy link
Owner

orhun commented Aug 13, 2021

Thanks for reporting this.

It turns out this high memory usage happens at the following line:

let commits = repository.commits(commit_range)?;

Which calls this function:

pub fn commits(&self, range: Option<String>) -> Result<Vec<Commit>> {
let mut revwalk = self.inner.revwalk()?;
revwalk.set_sorting(Sort::TIME | Sort::TOPOLOGICAL)?;
if let Some(range) = range {
revwalk.push_range(&range)?;
} else {
revwalk.push_head()?;
}
Ok(revwalk
.filter_map(|id| id.ok())
.filter_map(|id| self.inner.find_commit(id).ok())
.collect())
}

In conclusion I'd say this is most likely caused by git2.

So this issue basically boils down to:

use git2::{Commit, Repository, Sort};
use std::env;

fn main() {
    let repo_path = env::var("LINUX_KERNEL_REPO").expect("repo path is not specified");
    let repo = Repository::open(repo_path).expect("cannot open repo");
    let mut revwalk = repo.revwalk().unwrap();
    revwalk.set_sorting(Sort::TIME | Sort::TOPOLOGICAL).unwrap();
    revwalk.push_head().unwrap();
    let commits: Vec<Commit> = revwalk
        .filter_map(|id| id.ok())
        .filter_map(|id| repo.find_commit(id).ok())
        .collect();
    println!("{}", commits.len());
}

To reproduce:

cargo new --bin repro && cd repro/
# add `git2 = "0.13.21"` to [dependencies] in Cargo.toml
# save the code above as src/main.rs
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux
LINUX_KERNEL_REPO="$(pwd)/linux" cargo run

I think you should also report this to git2, there is not much I can do here.

@Byron
Copy link
Author

Byron commented Aug 13, 2021

Thanks for investigating this.

It's interesting that running the above I see this:

➜  git-cliff-core git:(main) LINUX_KERNEL_REPO=~/dev/github.com/torvalds/linux/.git /usr/bin/time -lp cargo run --release --example reproduce
    Finished release [optimized] target(s) in 0.09s
     Running `/Users/byron/dev/github.com/orhun/git-cliff/target/release/examples/reproduce`
1015172
real 18.35
user 17.55
sys 0.61
          2273345536  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              199192  page reclaims
                 645  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   1  signals received
                 589  voluntary context switches
                1596  involuntary context switches
        176305556133  instructions retired
         57509393673  cycles elapsed
          1513012480  peak memory footprint

Maybe the real memory explosion happens elsewhere when processing more than a million commits.

@orhun
Copy link
Owner

orhun commented Aug 13, 2021

It's interesting that running the above I see this:

Ah, I just get a similar result. But it took longer due to my low specs I guess.

Maybe the real memory explosion happens elsewhere when processing more than a million commits.

I'm re-investigating this issue. 👍🏼

@alerque
Copy link
Contributor

alerque commented Aug 13, 2021

Sadly libgit2 is missing some significant optimizations that the git CLI tooling has. I've run into resource issues like this on much smaller repos than the Linux kernel where the CLI tooling flies right along and the equivalent calls to the library sink the ship.

@orhun
Copy link
Owner

orhun commented Aug 13, 2021

I pushed f859747 and it should affect the performance dramatically. In fact, I was able to generate a changelog from the linux kernel repository this time:

$ cargo run --release -- -r ~/gh/linux/ -c cliff.toml -o LINUX_CHANGELOG

results in:

# Changelog
All notable changes to this project will be documented in this file.

## [unreleased]

### ALSA

- Pcm: Fix mmap breakage without explicit buffer setup
- Hda/realtek: fix mute/micmute LEDs for HP ProBook 650 G8 Notebook PC

### MAINTAINERS

- Update Vineet's email address
- Fix Microchip CAN BUS Analyzer Tool entry typo
- Switch to my OMP email for Renesas Ethernet drivers

### Security

- Igmp: fix data-race in igmp_ifc_timer_expire()

[...]

Can you try it out to see if it's any better?

@Byron
Copy link
Author

Byron commented Aug 14, 2021

Fantastic, the fix is probably one of the most effective one-line changes I have ever seen!

Here it the tail of my cliff run on the linux kernel:

real 31.97
user 25.32
sys 3.68
          2934489088  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             1154355  page reclaims
                  59  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
               25376  voluntary context switches
               16458  involuntary context switches
            32094006  instructions retired
            20068389  cycles elapsed
             2786048  peak memory footprint

I think that's quite alright :).

In case you are interested in being even faster, here is another tool to estimate the hours it would take to implement the commits of a repository.

➜  linux git:(master) ✗ /usr/bin/time -lp gix tools estimate-hours
 9:49:55 Traverse commit graph done 1.0M commits in 7.55s (134.5k commits/s)
total hours: 979612.44
total 8h days: 122451.55
total commits = 1015172
total authors: 28234
total unique authors: 21359 (24.35% duplication)
 9:49:56                  find Extracted and organized data from 1015172 commits in 807.375125ms (1257373 commits/s)
real 8.45
user 8.21
sys 1.13
          1743454208  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              117714  page reclaims
               11193  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   3  voluntary context switches
                9337  involuntary context switches
         54347066220  instructions retired
         28594401548  cycles elapsed
           976183360  peak memory footprint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants