Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(components/memory): use common db + dmesg poller for events, move out of "dmesg" component #324

Merged
merged 2 commits into from
Jan 26, 2025

Conversation

gyuho
Copy link
Collaborator

@gyuho gyuho commented Jan 22, 2025

c.f., #322

Tested:

Screenshot 2025-01-25 at 10 38 26 AM
    "component": "memory",
    "startTime": "2024-11-15T04:48:56Z",
    "endTime": "2025-01-25T02:39:01.610443725Z",
    "events": [
      {
        "time": "2025-01-25T02:38:04Z",
        "name": "memory_oom_kill_constraint",
        "type": "Warning",
        "message": "oom kill constraint detected",
        "extra_info": {
          "log_line": "oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-3fc28fa1c647ceede9f6b340d0b16c9f1f663698972d22a52e296f291638e014.scope,mems_allowed=0-1,oom_memcg=/lxc.payload.ny2g2r5hh3-lxc/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda8697f49_441d_4d4f_90d2_6d8e1fa3bbe7.slice,task_memcg=/lxc.payload.ny2g2r5hh3-lxc/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda8697f49_441d_4d4f_90d2_6d8e1fa3bbe7.slice/cri-containerd-3fc28fa1c647ceede9f6b340d0b16c9f1f663698972d22a52e296f291638e014.scope,task=node,pid=863987,uid=0"
        },
        "suggested_actions": {
          "descriptions": null,
          "repair_actions": null
        }
      },
      {
        "time": "2025-01-25T02:36:53Z",
        "name": "memory_oom_kill_constraint",
        "type": "Warning",
        "message": "oom kill constraint detected",
        "extra_info": {
          "log_line": "oom-kill:constraint=OTHER_CONSTRAINT"
        },
        "suggested_actions": {
          "descriptions": null,
          "repair_actions": null
        }
      },
      {
        "time": "2025-01-25T02:36:50Z",
        "name": "memory_oom_kill_constraint",
        "type": "Warning",
        "message": "oom kill constraint detected",
        "extra_info": {
          "log_line": "oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-3fc28fa1c647ceede9f6b340d0b16c9f1f663698972d22a52e296f291638e014.scope,mems_allowed=0-1,oom_memcg=/lxc.payload.ny2g2r5hh3-lxc/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda8697f49_441d_4d4f_90d2_6d8e1fa3bbe7.slice,task_memcg=/lxc.payload.ny2g2r5hh3-lxc/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda8697f49_441d_4d4f_90d2_6d8e1fa3bbe7.slice/cri-containerd-3fc28fa1c647ceede9f6b340d0b16c9f1f663698972d22a52e296f291638e014.scope,task=node,pid=863987,uid=0"
        },
        "suggested_actions": {
          "descriptions": null,
          "repair_actions": null
        }
      },
      {
        "time": "2025-01-25T02:36:46Z",
        "name": "memory_oom",
        "type": "Warning",
        "message": "oom detected",
        "extra_info": {
          "log_line": "Out of memory: Kill process 456 (python) score 50 or sacrifice child"
        },
        "suggested_actions": {
          "descriptions": null,
          "repair_actions": null
        }
      },
      {
        "time": "2025-01-25T02:36:42Z",
        "name": "memory_oom",
        "type": "Warning",
        "message": "oom detected",
        "extra_info": {
          "log_line": "Out of memory: Killed process 123, UID 48, (httpd)"
        },
        "suggested_actions": {
          "descriptions": null,
          "repair_actions": null
        }
      }
    ]

@gyuho gyuho added this to the v0.4.0 milestone Jan 22, 2025
@gyuho gyuho self-assigned this Jan 22, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jan 22, 2025

Codecov Report

Attention: Patch coverage is 45.04505% with 61 lines in your changes missing coverage. Please review.

Project coverage is 21.39%. Comparing base (02d1814) to head (13cc281).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
components/memory/component.go 0.00% 27 Missing ⚠️
components/memory/watcher.go 58.33% 20 Missing and 5 partials ⚠️
internal/server/server.go 0.00% 5 Missing ⚠️
pkg/dmesg/watcher.go 25.00% 3 Missing ⚠️
components/dmesg/filters.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #324      +/-   ##
==========================================
+ Coverage   21.21%   21.39%   +0.18%     
==========================================
  Files         300      300              
  Lines       27089    27119      +30     
==========================================
+ Hits         5747     5803      +56     
+ Misses      20697    20667      -30     
- Partials      645      649       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gyuho gyuho force-pushed the oom-events branch 2 times, most recently from 78ce5c9 to 8767ed5 Compare January 22, 2025 08:28
pkg/dmesg/watcher.go Outdated Show resolved Hide resolved
components/memory/db/events.go Outdated Show resolved Hide resolved
@eahydra
Copy link
Collaborator

eahydra commented Jan 22, 2025

About the test result:
image
IMO we only need to get an aggregated result? like OOM happened, killed processA(pid: 123)?

@gyuho
Copy link
Collaborator Author

gyuho commented Jan 22, 2025

@eahydra This is the current behavior. Needs confirmation from @cardyok if we need to aggregate.

@cardyok
Copy link
Collaborator

cardyok commented Jan 22, 2025

i think for dmesg errors that contains multiple lines to determine actually issue (like oom), its the best if we can aggregate it, but might make gpud logic too complicated

@gyuho
Copy link
Collaborator Author

gyuho commented Jan 22, 2025

@eahydra @cardyok

can aggregate

Oh actually, these are all different events, different log lines (just happens to be at the same time, since I injected them at the same time)

We can aggregate the event "name" to OOM or something, but those do not seem to be used anyways

@cardyok
Copy link
Collaborator

cardyok commented Jan 22, 2025

oh if they actually belong to different OOM event,s then i think its totally fine not aggregating them

@xiang90
Copy link
Contributor

xiang90 commented Jan 22, 2025

The name should not be dmesg_matched... Let us fix this. Name should be OOMed or something more descriptive.

@gyuho gyuho added the wip - do not merge working in progress label Jan 24, 2025
@gyuho gyuho force-pushed the oom-events branch 2 times, most recently from 4977fca to 9dff6f3 Compare January 24, 2025 15:57
@gyuho gyuho removed the wip - do not merge working in progress label Jan 24, 2025
3*24*time.Hour,
)
if err != nil {
ccancel()
Copy link

@ccding ccding Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a bit strange to cancel on error. Do we really need to create a cctx, ccancel := context.WithCancel(ctx)? Can we just use ctx for getDefaultPoller().Start and startWatch(cctx, eventsStore)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating event store first, so we don't need call ccancel

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cardyok do you have any idea how to make the ctx usage look better?

@gyuho gyuho changed the title feat(components/memory): use db + dmesg poller for events feat(components/memory): use common db + dmesg poller for events, move out of "dmesg" component Jan 25, 2025
components/memory/watcher.go Outdated Show resolved Hide resolved
components/memory/watcher.go Outdated Show resolved Hide resolved
pkg/dmesg/testdata/dmesg.decode.iso.log.1 Outdated Show resolved Hide resolved
@gyuho
Copy link
Collaborator Author

gyuho commented Jan 25, 2025

@ccding All fixed, thx

components/memory/component.go Outdated Show resolved Hide resolved
Signed-off-by: Gyuho Lee <[email protected]>
@gyuho gyuho merged commit be52aef into main Jan 26, 2025
6 checks passed
@gyuho gyuho deleted the oom-events branch January 26, 2025 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants