-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LAYOUTS] Cache LinearLayout creation #5542
Conversation
third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/TritonGPUToLLVM.cpp
Outdated
Show resolved
Hide resolved
Once CI is passing, it would be good to measure the compile time difference before and after for a few different kernels. |
Agreed. In the kernel I was benchmarking, compilation time went from 5 min to 2.30 min. This is in line with the issue reported by Meta, where they had seen a 2x increase in compilation times in the last few months Will benchmark others. |
It was reported that triton compilation times have heavily increased lately. The cause of this is that we very often create the associated LL to check properties of a given Layout. We do this thousands of times, and this gets very expensive. In this PR, we implement a thread-safe cache for LinearLayouts In the future, we will make `DistributedEncoding` inherit from `LinearLayoutEncoding`, which will mean that `DistributedEncoding`s will always have access to their associated LinearLayout. Even in this scenario I still think that caching will be good, as there is no real 1-to-1 correspondence between `DistributedEncoding`s and `LinearLayout`s due to broadcasting, where we tile a layout along the tensor or we make it smaller. As such, I think this cache may be also useful in the future.
In the internal test it went from 9:11 min to 4:11 min. |
std::shared_lock lock(mutex); | ||
auto it = cache.find(key); | ||
if (it != cache.end()) { | ||
return it->second; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, why not add entry to cache on successful find, but leave it up to the user of API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wanted to keep the cache reasonably general, rather than have it depend on toLinearLayout
, to split the responsabilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it for Peter's question instead of Pawel's? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the get
and set
methods are separated because third-party backends may not have defined a corresponding linear layout conversion for some of their legacy layouts
|
||
private: | ||
std::unordered_map<CacheKey, LinearLayout> cache; | ||
llvm::sys::SmartRWMutex<true> mutex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Since you're not actually calling lock_shared
this could just be a plain mutex.
llvm::sys::SmartRWMutex<true> mutex; | |
llvm::sys::SmartMutex<true> mutex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm calling shared_lock
in the get
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah didn't see that the guard variable had changed type.
It was reported that triton compilation times have heavily increased
lately. The cause of this is that we very often create the associated LL
to check properties of a given Layout. We do this thousands of times,
and this gets very expensive.
In this PR, we implement a thread-safe cache for LinearLayouts. We clear this
cache after we are done with the TTGIR -> LLVM conversion.
In the future, we will make
DistributedEncoding
inherit fromLinearLayoutEncoding
, which will mean thatDistributedEncoding
swill always have access to their associated LinearLayout. Even in this
scenario I still think that caching will be good, as there is no real
1-to-1 correspondence between
DistributedEncoding
s andLinearLayout
sdue to broadcasting, where we tile a layout along the tensor or we make
it smaller. As such, I think this cache may be also useful in the
future.