Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Opt] let LRU mode use device clock #137

Merged
merged 1 commit into from
Jun 7, 2023

Conversation

rhdong
Copy link
Member

@rhdong rhdong commented Jun 7, 2023

  • Remove cur_score from Bucket
  • Switch benchmark to LRU strategy

- Remove `cur_score` from `Bucket`
- Switch benchmark to LRU strategy
@rhdong rhdong requested review from Lifann and jiashuy June 7, 2023 01:58
@github-actions
Copy link

github-actions bot commented Jun 7, 2023

@rhdong
Copy link
Member Author

rhdong commented Jun 7, 2023

/blossom-ci

@@ -177,7 +177,7 @@ float test_one_api(const API_Select api, const size_t dim,
options.dim = dim;
options.max_hbm_for_vectors = nv::merlin::GB(hbm4values);
options.io_by_cpu = io_by_cpu;
options.evict_strategy = EvictStrategy::kCustomized;
options.evict_strategy = EvictStrategy::kLru;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question whether use kLru or KCustimized as the benchmark is depend on what.

Copy link
Member Author

@rhdong rhdong Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Lru would be more popular, according to recent feedback from end-users. So I changed it, but we can provide both when the performance becomes stable.

template <class S>
static __forceinline__ __device__ S device_nano() {
S mclk;
asm volatile("mov.u64 %0,%%globaltimer;" : "=l"(mclk));
Copy link
Collaborator

@Lifann Lifann Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a test to compare the timestamp from timer on gpu and cpu. And make sure that the deviation would be in limited range.

Copy link
Member Author

@rhdong rhdong Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're a little different, but I provide a host_nano for the host code to get the device timer in the test and benchmark util

S cur_score =
bucket->cur_score.fetch_add(1, cuda::std::memory_order_relaxed) + 1;
bucket->scores(key_pos)->store(cur_score, cuda::std::memory_order_relaxed);
bucket->scores(key_pos)->store(device_nano<S>(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kLru is faster than kCustomized in benchmark. Is it because that the src loaded is changed from L2-ram to register?

Copy link
Member Author

@rhdong rhdong Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. Plus, the fetch_add on the score for one key should not be enough to fill up the cache line, so the result also benefits from canceling it.

Copy link
Collaborator

@Lifann Lifann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rhdong rhdong merged commit 460db25 into NVIDIA-Merlin:master Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants