[Opt] let LRU mode use device clock #137

rhdong · 2023-06-07T01:58:27Z

Remove cur_score from Bucket
Switch benchmark to LRU strategy

- Remove `cur_score` from `Bucket` - Switch benchmark to LRU strategy

github-actions · 2023-06-07T02:00:06Z

Documentation preview

https://nvidia-merlin.github.io/HierarchicalKV/review/pr-137

rhdong · 2023-06-07T02:02:00Z

/blossom-ci

jiashuy · 2023-06-07T02:19:51Z

benchmark/merlin_hashtable_benchmark.cc.cu

@@ -177,7 +177,7 @@ float test_one_api(const API_Select api, const size_t dim,
  options.dim = dim;
  options.max_hbm_for_vectors = nv::merlin::GB(hbm4values);
  options.io_by_cpu = io_by_cpu;
-  options.evict_strategy = EvictStrategy::kCustomized;
+  options.evict_strategy = EvictStrategy::kLru;


I have a question whether use kLru or KCustimized as the benchmark is depend on what.

The Lru would be more popular, according to recent feedback from end-users. So I changed it, but we can provide both when the performance becomes stable.

Lifann · 2023-06-07T05:10:07Z

include/merlin/utils.cuh

+template <class S>
+static __forceinline__ __device__ S device_nano() {
+  S mclk;
+  asm volatile("mov.u64 %0,%%globaltimer;" : "=l"(mclk));


Maybe add a test to compare the timestamp from timer on gpu and cpu. And make sure that the deviation would be in limited range.

They're a little different, but I provide a host_nano for the host code to get the device timer in the test and benchmark util

Lifann · 2023-06-07T05:22:19Z

include/merlin/core_kernels.cuh

-    S cur_score =
-        bucket->cur_score.fetch_add(1, cuda::std::memory_order_relaxed) + 1;
-    bucket->scores(key_pos)->store(cur_score, cuda::std::memory_order_relaxed);
+    bucket->scores(key_pos)->store(device_nano<S>(),


kLru is faster than kCustomized in benchmark. Is it because that the src loaded is changed from L2-ram to register?

Yes, I think so. Plus, the fetch_add on the score for one key should not be enough to fill up the cache line, so the result also benefits from canceling it.

Lifann

LGTM

[Opt] let LRU mode use device clock

8d1edab

- Remove `cur_score` from `Bucket` - Switch benchmark to LRU strategy

rhdong requested review from Lifann and jiashuy June 7, 2023 01:58

jiashuy reviewed Jun 7, 2023

View reviewed changes

jiashuy approved these changes Jun 7, 2023

View reviewed changes

Lifann reviewed Jun 7, 2023

View reviewed changes

Lifann approved these changes Jun 7, 2023

View reviewed changes

rhdong merged commit 460db25 into NVIDIA-Merlin:master Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Opt] let LRU mode use device clock #137

[Opt] let LRU mode use device clock #137

rhdong commented Jun 7, 2023

github-actions bot commented Jun 7, 2023

rhdong commented Jun 7, 2023

jiashuy Jun 7, 2023

rhdong Jun 7, 2023 •

edited

Loading

Lifann Jun 7, 2023 •

edited

Loading

rhdong Jun 7, 2023 •

edited

Loading

Lifann Jun 7, 2023

rhdong Jun 7, 2023 •

edited

Loading

Lifann left a comment

[Opt] let LRU mode use device clock #137

[Opt] let LRU mode use device clock #137

Conversation

rhdong commented Jun 7, 2023

github-actions bot commented Jun 7, 2023

Documentation preview

rhdong commented Jun 7, 2023

jiashuy Jun 7, 2023

Choose a reason for hiding this comment

rhdong Jun 7, 2023 • edited Loading

Choose a reason for hiding this comment

Lifann Jun 7, 2023 • edited Loading

Choose a reason for hiding this comment

rhdong Jun 7, 2023 • edited Loading

Choose a reason for hiding this comment

Lifann Jun 7, 2023

Choose a reason for hiding this comment

rhdong Jun 7, 2023 • edited Loading

Choose a reason for hiding this comment

Lifann left a comment

Choose a reason for hiding this comment

rhdong Jun 7, 2023 •

edited

Loading

Lifann Jun 7, 2023 •

edited

Loading

rhdong Jun 7, 2023 •

edited

Loading

rhdong Jun 7, 2023 •

edited

Loading