Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cpu-usec metric support under CLUSTER SLOT-STATS command (#20). #712

Merged
merged 9 commits into from
Jul 23, 2024

Conversation

kyle-yh-kim
Copy link
Contributor

The metric tracks cpu time in micro-seconds, sharing the same value as INFO COMMANDSTATS, aggregated under per-slot context.

…io#20).

The metric tracks cpu time in micro-seconds, sharing the same
value as INFO COMMANDSTATS, aggregated under per-slot context.

Signed-off-by: Kyle Kim <[email protected]>
@kyle-yh-kim
Copy link
Contributor Author

First revision only holds implementation changes for initial feedback purposes, with pending perf testing.
Integration tests are not up-to-date, and thus failing. This will soon be followed-up.

src/cluster_slot_stats.c Outdated Show resolved Hide resolved
src/cluster_slot_stats.c Outdated Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
src/cluster_slot_stats.c Outdated Show resolved Hide resolved
src/cluster_slot_stats.h Outdated Show resolved Hide resolved
src/cluster_slot_stats.c Outdated Show resolved Hide resolved
@zvi-code
Copy link
Contributor

zvi-code commented Jul 3, 2024

@kyle-yh-kim

I totally get the value in per-slot information! But wondering about the how part.

Have you considered alternatives? For example:

  1. extend server.duration_stats[<type>] to server.duration_stats[<type>][max_slots +1] ?
  2. extend the real command to have slot dimension?
  3. keep the info on the client until the slot of the client is reset...

The reason I raise this is we have many slot array in many places in the code and this seems to be arbitrary addition to that and on the other hand we have several duration tracking mechanisms and this adds another one. I am wondering if we can merge/reuse these. In addition I can see we would want additional information like command count and maybe histogram.. this will not extend natively and we should strive to reuse what we have or enhance it, IMO.

@kyle-yh-kim
Copy link
Contributor Author

Thanks for your comments. To answer your concerns;

Have you considered alternatives? For example:

  1. extend server.duration_stats[] to server.duration_stats[][max_slots +1] ?
  2. extend the real command to have slot dimension?
  3. keep the info on the client until the slot of the client is reset...

The reason I raise this is we have many slot array in many places in the code and this seems to be arbitrary addition to that and on the other hand we have several duration tracking mechanisms and this adds another one.

For cpu-usec explicitly, I agree that this metric alone could be squeezed into other adjacent storages mentioned above, with added refactoring efforts.

However, when reviewing the per-slot metrics holistically, with the additions of per-slot 1) network-bytes-in [PR link], 2) network-bytes-out [PR-link] and 3) memory-bytes [PR-pending], this sparse disconnection between cpu-usec and the remaining per-slot metrics no longer makes sense.

Plus, we obtain cache locality benefits from keeping all metrics memory contiguous, as the CLUSTER SLOT-STATS command accesses all tracked metrics (cpu, network-bytes-in, network-bytes-out, memory-bytes) under the specified range of slots.

Given considerations of Valkey 8.0 rc1 code-cutoff timeline, refactoring efforts, remaining per-slot metrics storage cohesiveness, and cache locality benefits, my preference is to keep the slot_stats[] location as is. The only remaining change would be to follow the latest proposal from @zuiderkwast, which recommends to relocate the slot statistics array under clusterState such that it only consumes this memory in cluster mode;

typedef struct slotStat {
    uint64_t cpu_usec;
    uint64_t network_bytes_in;
    uint64_t network_bytes_out;
    uint64_t memory_bytes;
} slotStat;

struct clusterState {
    ...
    slotStat slot_stats[CLUSTER_SLOTS];
};

- Added more integration tests.
- Updated canAddCpuDuration() condition to avoid duplicate accounting
  upon exec, eval and fcall commands.

Signed-off-by: Kyle Kim <[email protected]>
src/cluster_slot_stats.h Outdated Show resolved Hide resolved
@madolson
Copy link
Member

madolson commented Jul 14, 2024

The reason I raise this is we have many slot array in many places in the code and this seems to be arbitrary addition to that and on the other hand we have several duration tracking mechanisms and this adds another one. I am wondering if we can merge/reuse these. In addition I can see we would want additional information like command count and maybe histogram.. this will not extend natively and we should strive to reuse what we have or enhance it, IMO.

Command count might make, I'm not sure we really want a histogram on a per-slot basis seems to be missing the point of statistics though. The point of these metrics is for identify hot shards for migration for auto-balancing, it's not for trying to introspect too much into a specific slot.

@kyle-yh-kim kyle-yh-kim force-pushed the 20-cpu branch 2 times, most recently from 9659a40 to d3d71c1 Compare July 15, 2024 03:39
- Added a modifiable config, "cluster-slot-stats-enabled", with default value false.
- Added comments for bypassing EXEC, EVAL and FCALL calculations.
- Added integration test cases for functions.
- Fixed existing integration test cases.
- Moved slot_stats array under clusterState.

Signed-off-by: Kyle Kim <[email protected]>
src/cluster_slot_stats.c Outdated Show resolved Hide resolved
src/cluster_slot_stats.c Outdated Show resolved Hide resolved
src/cluster_slot_stats.c Outdated Show resolved Hide resolved
src/cluster_slot_stats.c Outdated Show resolved Hide resolved
src/config.c Show resolved Hide resolved
tests/unit/cluster/slot-stats.tcl Outdated Show resolved Hide resolved
src/cluster_slot_stats.c Outdated Show resolved Hide resolved
- Updated the aggregation to bypass nested commands, and instead
  account for wrapper commands (EXEC, EVAL, FCALL).
- Added slot invalidation for cross-slot lua scripting.
- Added resetClusterStats().
- Added documentation to valkey.conf

Signed-off-by: Kyle Kim <[email protected]>
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, just some minor comments, please merge and we can address the comments and merge.

src/cluster_slot_stats.c Outdated Show resolved Hide resolved
src/cluster_slot_stats.c Show resolved Hide resolved
valkey.conf Outdated Show resolved Hide resolved
@madolson madolson added the release-notes This issue should get a line item in the release notes label Jul 22, 2024
Copy link

codecov bot commented Jul 22, 2024

Codecov Report

Attention: Patch coverage is 94.28571% with 2 lines in your changes missing coverage. Please review.

Project coverage is 70.36%. Comparing base (14e09e9) to head (de6c75e).
Report is 1 commits behind head on unstable.

Additional details and impacted files
@@            Coverage Diff            @@
##           unstable     #712   +/-   ##
=========================================
  Coverage     70.36%   70.36%           
=========================================
  Files           112      112           
  Lines         61275    61308   +33     
=========================================
+ Hits          43115    43141   +26     
- Misses        18160    18167    +7     
Files Coverage Δ
src/blocked.c 91.48% <100.00%> (+0.03%) ⬆️
src/cluster.c 88.35% <100.00%> (+0.05%) ⬆️
src/cluster_legacy.c 85.94% <100.00%> (+0.21%) ⬆️
src/config.c 78.69% <100.00%> (+0.01%) ⬆️
src/script.c 87.50% <100.00%> (+0.09%) ⬆️
src/server.c 88.56% <100.00%> (+<0.01%) ⬆️
src/server.h 100.00% <ø> (ø)
src/cluster_slot_stats.c 90.32% <92.00%> (+1.10%) ⬆️

... and 7 files with indirect coverage changes

@kyle-yh-kim kyle-yh-kim force-pushed the 20-cpu branch 3 times, most recently from 1a1ce40 to 930fe95 Compare July 22, 2024 20:26
- Updated integration test to assert against expected slots existence.
- Updated valkey.conf

Signed-off-by: Kyle Kim <[email protected]>
src/cluster_slot_stats.c Outdated Show resolved Hide resolved
src/cluster_slot_stats.c Outdated Show resolved Hide resolved
Signed-off-by: Madelyn Olson <[email protected]>
- Updated ORDERBY handling to raise error appropriately.
- Updated integration tests to include the above test case.

Signed-off-by: Kyle Kim <[email protected]>
} else {
addReplyError(c, "Unrecognized sort metric for ORDER BY. The supported metrics are: key-count.");
addReplyError(c, "Unrecognized sort metric for ORDERBY.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this error, it would be nice to be more useful. We can follow up on this though.

src/server.h Outdated Show resolved Hide resolved
Signed-off-by: Madelyn Olson <[email protected]>
@madolson
Copy link
Member

Reply schema looked good: https://github.com/valkey-io/valkey/actions/runs/10050729040

@madolson madolson merged commit 5000c05 into valkey-io:unstable Jul 23, 2024
21 checks passed
hwware pushed a commit to hwware/valkey that referenced this pull request Jul 25, 2024
…io#20). (valkey-io#712)

The metric tracks cpu time in micro-seconds, sharing the same value as
`INFO COMMANDSTATS`, aggregated under per-slot context.

---------

Signed-off-by: Kyle Kim <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
zuiderkwast added a commit to valkey-io/valkey-doc that referenced this pull request Nov 18, 2024
Docs for CLUSTER SLOT-STATS, with key-count, cpu-usec, network-bytes-in,
and network-bytes-out metrics.

- valkey-io/valkey#351
- valkey-io/valkey#712
- valkey-io/valkey#720

---------

Signed-off-by: Kyle Kim <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
Signed-off-by: Viktor Söderqvist <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants