Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LAYOUTS] Cache LinearLayout creation #5542

Merged
merged 4 commits into from
Jan 8, 2025
Merged

[LAYOUTS] Cache LinearLayout creation #5542

merged 4 commits into from
Jan 8, 2025

Conversation

lezcano
Copy link
Contributor

@lezcano lezcano commented Jan 6, 2025

It was reported that triton compilation times have heavily increased
lately. The cause of this is that we very often create the associated LL
to check properties of a given Layout. We do this thousands of times,
and this gets very expensive.

In this PR, we implement a thread-safe cache for LinearLayouts. We clear this
cache after we are done with the TTGIR -> LLVM conversion.

In the future, we will make DistributedEncoding inherit from
LinearLayoutEncoding, which will mean that DistributedEncodings
will always have access to their associated LinearLayout. Even in this
scenario I still think that caching will be good, as there is no real
1-to-1 correspondence between DistributedEncodings and LinearLayouts
due to broadcasting, where we tile a layout along the tensor or we make
it smaller. As such, I think this cache may be also useful in the
future.

@lezcano lezcano requested a review from ptillet as a code owner January 6, 2025 22:57
@lezcano lezcano requested review from Jokeren and peterbell10 January 6, 2025 22:57
@lezcano lezcano changed the title Cache LinearLayout creation [LAYOUTS] Cache LinearLayout creation Jan 6, 2025
@peterbell10
Copy link
Contributor

Once CI is passing, it would be good to measure the compile time difference before and after for a few different kernels.

@lezcano
Copy link
Contributor Author

lezcano commented Jan 6, 2025

Agreed. In the kernel I was benchmarking, compilation time went from 5 min to 2.30 min. This is in line with the issue reported by Meta, where they had seen a 2x increase in compilation times in the last few months Will benchmark others.

It was reported that triton compilation times have heavily increased
lately. The cause of this is that we very often create the associated LL
to check properties of a given Layout. We do this thousands of times,
and this gets very expensive.

In this PR, we implement a thread-safe cache for LinearLayouts

In the future, we will make `DistributedEncoding` inherit from
`LinearLayoutEncoding`, which will mean that `DistributedEncoding`s
will always have access to their associated LinearLayout. Even in this
scenario I still think that caching will be good, as there is no real
1-to-1 correspondence between `DistributedEncoding`s and `LinearLayout`s
due to broadcasting, where we tile a layout along the tensor or we make
it smaller. As such, I think this cache may be also useful in the
future.
@lezcano
Copy link
Contributor Author

lezcano commented Jan 7, 2025

In the internal test it went from 9:11 min to 4:11 min.
Running test_core.py with 32 cores just went down from 1:38min to 1:31min tho, which makes me believe that the internal test was a particularly tricky example for the compiler.

std::shared_lock lock(mutex);
auto it = cache.find(key);
if (it != cache.end()) {
return it->second;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why not add entry to cache on successful find, but leave it up to the user of API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to keep the cache reasonably general, rather than have it depend on toLinearLayout, to split the responsabilities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it for Peter's question instead of Pawel's? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the get and set methods are separated because third-party backends may not have defined a corresponding linear layout conversion for some of their legacy layouts


private:
std::unordered_map<CacheKey, LinearLayout> cache;
llvm::sys::SmartRWMutex<true> mutex;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Since you're not actually calling lock_shared this could just be a plain mutex.

Suggested change
llvm::sys::SmartRWMutex<true> mutex;
llvm::sys::SmartMutex<true> mutex;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm calling shared_lock in the get method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah didn't see that the guard variable had changed type.

@lezcano lezcano merged commit 67f5707 into main Jan 8, 2025
7 checks passed
@lezcano lezcano deleted the caching branch January 8, 2025 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants