Skip to content

Commit

Permalink
feat(nvidia/peermem): explicitly skip "invalid context" errors
Browse files Browse the repository at this point in the history
as the latest driver fixes the issue

ref. Mellanox/nv_peer_memory#120

Signed-off-by: Gyuho Lee <[email protected]>
  • Loading branch information
gyuho committed Jan 6, 2025
1 parent 35fe95a commit 814b42a
Showing 1 changed file with 10 additions and 1 deletion.
11 changes: 10 additions & 1 deletion components/accelerator/nvidia/peermem/component.go
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,16 @@ func (c *component) Events(ctx context.Context, since time.Time) ([]components.E
if logItem.Matched == nil {
continue
}
if logItem.Matched.Name != dmesg.EventNvidiaPeermemInvalidContext {

// skip this for now as the latest driver https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-560-35-03/index.html#abstract fixes this issue
// "4272659 – A design defect has been identified and mitigated in the GPU kernel-mode driver, related to the GPUDirect RDMA support
// in MLNX_OFED and some Ubuntu kernels, commonly referred to as the PeerDirect technology, i.e. the one using the peer-memory kernel
// patch. In specific scenarios, for example involving the cleanup after killing of a multi-process application, this issue may lead to
// use-after-free and potentially to kernel memory corruption."
//
// ref. https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-129-03/index.html
// ref. https://github.com/Mellanox/nv_peer_memory/issues/120
if logItem.Matched.Name == dmesg.EventNvidiaPeermemInvalidContext {
continue
}

Expand Down

0 comments on commit 814b42a

Please sign in to comment.