Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate crates in crate graph #14476

Closed
Veykril opened this issue Apr 3, 2023 · 4 comments · Fixed by #14659
Closed

Deduplicate crates in crate graph #14476

Veykril opened this issue Apr 3, 2023 · 4 comments · Fixed by #14659
Labels
A-perf performance issues Broken Window Bugs / technical debt to be addressed immediately C-bug Category: bug

Comments

@Veykril
Copy link
Member

Veykril commented Apr 3, 2023

It's incredible that we aren't actually doing this already. When we load a project that contains multiple workspaces we do not deduplicate equal crates in the crategraph! That means just for the sysroot alone that it is loaded multiple times into memory and queries executed on the sysroot of one workspace are not reused for another. It wouldn't surprise me if some of the high memory usage reports we get comes from projects that have a lot of cargo workspaces in them.

Here all we do is append crates, but we are not deduplicating them. Nor are we doing so after having built the final crategraph.

/// Extends this crate graph by adding a complete disjoint second crate
/// graph and adjust the ids in the [`ProcMacroPaths`] accordingly.
///
/// The ids of the crates in the `other` graph are shifted by the return
/// amount.
pub fn extend(&mut self, other: CrateGraph, proc_macros: &mut ProcMacroPaths) -> u32 {
let start = self.arena.len() as u32;
self.arena.extend(other.arena.into_iter().map(|(id, mut data)| {
let new_id = id.shift(start);
for dep in &mut data.dependencies {
dep.crate_id = dep.crate_id.shift(start);
}
(new_id, data)
}));
*proc_macros = mem::take(proc_macros)
.into_iter()
.map(|(id, macros)| (id.shift(start), macros))
.collect();
start
}

@Veykril Veykril added A-perf performance issues C-bug Category: bug labels Apr 3, 2023
@Veykril Veykril changed the title Deduplicate crates Deduplicate crates in crate graph Apr 3, 2023
@Veykril Veykril added the Broken Window Bugs / technical debt to be addressed immediately label Apr 3, 2023
@bruno-ortiz
Copy link
Contributor

@Veykril what would be the best aproach on this? I'm thinking in two ways of doing it.

1- Something like this:

pub fn extend(&mut self, other: CrateGraph, proc_macros: &mut ProcMacroPaths) -> u32 {
        let start = self.arena.len() as u32;
        let other: Vec<_> = other
            .arena
            .into_iter()
            .filter(|(_, data)| {
                !self.arena.iter().any(|(_, cd)| cd.root_file_id == data.root_file_id)
            })
            .map(|(_, mut data)| {
                for dep in &mut data.dependencies {
                    let new_crate_id = RawIdx::from(u32::from(dep.crate_id.into_raw()) + start);
                    dep.crate_id = CrateId::from_raw(new_crate_id);
                }
                data
            })
            .collect();

        self.arena.extend(other);
        *proc_macros = mem::take(proc_macros)
            .into_iter()
            .map(|(id, macros)| {
                (CrateId::from_raw(RawIdx::from(u32::from(id.into_raw()) + start)), macros)
            })
            .collect();
        start
    }

This applies depuplication only to the extend methods, dunno if we should worry of other points of insertion.

2- Making some sort of an ArenaSet, using an IndexSet instead of a Vec, this would implie that every 'T' should implement 'Hash', and I'm not sure if this would be worse performance wise, due to the hashing.

@Veykril
Copy link
Member Author

Veykril commented Apr 11, 2023

CrateData can't implement Hash due to our usage of hashmaps inside of it, so that's not an option. Your proposed extend method doesn't work, we need to check whether the two CrateDatas are equal (which may in turns require checking that the dependencies field is the same even if they point to different crate ids (which are the same ids when deduplicated).

@bruno-ortiz
Copy link
Contributor

bruno-ortiz commented Apr 11, 2023

Isn't the root_file_id always the same if pointed to the same path? And if it points to the same path doen't mean that it points to the same crate?
Maybe we should compare the crate's name and version to? Comparing the dependencies seems overkill if 2 crates have the same path, version, and name.
Maybe I'm wrong, I'm still learning my way through RA codebase.

@Veykril
Copy link
Member Author

Veykril commented Apr 11, 2023

root_file_ids aren't unique, we could have crates pointing to the same root file with differing cfg's (this will become relevant once we duplicate crates for dev-dependency edges to resolve cycles #14475. But you are right that there are probably some things in the CrateData that we can ignore in terms of deduplication (basically we'd need to check what makes up the "key/identity" of a crate given a CrateData)

That aside the dependency crate id fix up in your code is wrong though, once we start deduplicating shifting ideas naively won't work, we'd need to keep a mapping from duplicate IDs to non-duplicate ID I believe, which in turns might make the order we process these in relevant.

bors added a commit that referenced this issue Apr 26, 2023
Deduplicate crates when extending crate graphs

This is quadratic in runtime per deduplication attempt, but I don't think that'll be a problem for the workload here. Don't be scared of the diff, the actual diff is +42 -22, the rest is tests and test data.
Fixes #14476
@bors bors closed this as completed in 797c2f1 Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-perf performance issues Broken Window Bugs / technical debt to be addressed immediately C-bug Category: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants