[DOC] RMM hypervisor incompatibility advisory for managed pools? #652

lmeyerov · 2020-12-10T17:23:03Z

Report incorrect documentation

Location of incorrect documentation

Potentially anywhere above/below RMM: rapids.ai setup pages, rmm modules, nvidia hypervisor advisories, ...

Describe the problems or issues found in the documentation

In a discussion with @kkraus14 , it sounded that RMM managed is expected to fail on hypervisor / vmware setups. I didn't see public docs on this on rapids/cudf/rmm side, nor in the hypervisor advisories. I'm simultaneously trying to figure out what the issue is, including scope/workarounds, and helping a big enterprise trying to adopt RAPIDS-stack in a tough env to plan around that. Hopefully the issue rings a bell and we can help save stress for other devs+users as well.

Steps taken to verify documentation is incorrect

Filed this issue. We don't have a vmware testlab, so this is a surprise and currently tricky to understand on our end.

Suggested fix for documentation

Identity the issue
If real, include a non-portability advisory on:
- The hypervisor page (https://docs.nvidia.com/grid/latest/grid-vgpu-release-notes-vmware-vsphere/index.html#bug-2642134-rapids-cudf-merge-fails)
- Any RMM docs on managed policies (ex: 'managed does not work with hypervisors'?)
- Any rapids libs docs that describe what managed policies are

Report needed documentation

Report needed documentation
A clear and concise description of what documentation is needed and why.

Steps taken to search for needed documentation
List any steps you have taken.

The text was updated successfully, but these errors were encountered:

harrism · 2020-12-11T02:46:00Z

@lmeyerov this is documented here: https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#cuda-open-cl-support-vgpu

#656 adds a note to the RMM readme that managed_memory_resource does not work on vGPU.

Fixes #652

lmeyerov · 2020-12-11T05:28:20Z

Thank you for confirming and documenting so quickly, this gives us time over the weekend to get ahead of it for a difficult deployment group!

lmeyerov · 2020-12-18T21:10:55Z

@harrism FYI we pushed a patch and tried ='default' for a vSphere env, but no luck. Screenshot thread @ https://rapids-goai.slack.com/archives/C5E06F4DC/p1608324320035700 -- any tips for diagnostics to run as we prep a test kit for the environment?

lmeyerov · 2020-12-18T21:40:53Z

@kkraus14 Meant to check -- should I start an issue around cudf/rmm on vGPUs? we personally got it working on nutanix last year (and will be doing more in q1). this is my first time trying vsphere, though not sure what some of our more interesting users are doing.

harrism · 2021-01-05T00:00:10Z

From the thread, it seems the above is not an RMM issue.

lmeyerov · 2021-01-05T00:11:08Z

Yes, thanks @harrism. The docs fix seems fine for guiding folks to writing portable rmm code.

We hit other failure modes around vGPUs. Ex:, CUDA is disabled when the nvidia license is type = 0 (dev / unlicensed mode), which threw our users for 1-2w. In theory, py cuda libs can try to give better error messages, but probably not worth it, and more of a job for lower levels.

lmeyerov added ? - Needs Triage Need team to review and classify doc Documentation labels Dec 10, 2020

harrism mentioned this issue Dec 11, 2020

Document that managed_memory_resource does not work with NVIDIA vGPU #656

Merged

harrism removed the ? - Needs Triage Need team to review and classify label Dec 11, 2020

kkraus14 closed this as completed in #656 Dec 11, 2020

kkraus14 pushed a commit that referenced this issue Dec 11, 2020

Add note that managed_memory_resource doesn't work with vGPU. (#656)

21e5650

Fixes #652

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOC] RMM hypervisor incompatibility advisory for managed pools? #652

[DOC] RMM hypervisor incompatibility advisory for managed pools? #652

lmeyerov commented Dec 10, 2020

harrism commented Dec 11, 2020

lmeyerov commented Dec 11, 2020

lmeyerov commented Dec 18, 2020 •

edited

Loading

lmeyerov commented Dec 18, 2020 •

edited

Loading

harrism commented Jan 5, 2021

lmeyerov commented Jan 5, 2021

[DOC] RMM hypervisor incompatibility advisory for managed pools? #652

[DOC] RMM hypervisor incompatibility advisory for managed pools? #652

Comments

lmeyerov commented Dec 10, 2020

Report incorrect documentation

Report needed documentation

harrism commented Dec 11, 2020

lmeyerov commented Dec 11, 2020

lmeyerov commented Dec 18, 2020 • edited Loading

lmeyerov commented Dec 18, 2020 • edited Loading

harrism commented Jan 5, 2021

lmeyerov commented Jan 5, 2021

lmeyerov commented Dec 18, 2020 •

edited

Loading

lmeyerov commented Dec 18, 2020 •

edited

Loading