-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Opt] let LRU mode use device clock #137
Conversation
- Remove `cur_score` from `Bucket` - Switch benchmark to LRU strategy
Documentation previewhttps://nvidia-merlin.github.io/HierarchicalKV/review/pr-137 |
/blossom-ci |
@@ -177,7 +177,7 @@ float test_one_api(const API_Select api, const size_t dim, | |||
options.dim = dim; | |||
options.max_hbm_for_vectors = nv::merlin::GB(hbm4values); | |||
options.io_by_cpu = io_by_cpu; | |||
options.evict_strategy = EvictStrategy::kCustomized; | |||
options.evict_strategy = EvictStrategy::kLru; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question whether use kLru or KCustimized as the benchmark is depend on what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Lru
would be more popular, according to recent feedback from end-users. So I changed it, but we can provide both when the performance becomes stable.
template <class S> | ||
static __forceinline__ __device__ S device_nano() { | ||
S mclk; | ||
asm volatile("mov.u64 %0,%%globaltimer;" : "=l"(mclk)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a test to compare the timestamp from timer on gpu and cpu. And make sure that the deviation would be in limited range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're a little different, but I provide a host_nano
for the host code to get the device timer in the test and benchmark util
S cur_score = | ||
bucket->cur_score.fetch_add(1, cuda::std::memory_order_relaxed) + 1; | ||
bucket->scores(key_pos)->store(cur_score, cuda::std::memory_order_relaxed); | ||
bucket->scores(key_pos)->store(device_nano<S>(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kLru
is faster than kCustomized
in benchmark. Is it because that the src loaded is changed from L2-ram to register?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. Plus, the fetch_add
on the score
for one key should not be enough to fill up the cache line, so the result also benefits from canceling it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cur_score
fromBucket