Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] RMM hypervisor incompatibility advisory for managed pools? #652

Closed
lmeyerov opened this issue Dec 10, 2020 · 6 comments · Fixed by #656
Closed

[DOC] RMM hypervisor incompatibility advisory for managed pools? #652

lmeyerov opened this issue Dec 10, 2020 · 6 comments · Fixed by #656
Labels
doc Documentation

Comments

@lmeyerov
Copy link

Report incorrect documentation

Location of incorrect documentation

Potentially anywhere above/below RMM: rapids.ai setup pages, rmm modules, nvidia hypervisor advisories, ...

Describe the problems or issues found in the documentation

In a discussion with @kkraus14 , it sounded that RMM managed is expected to fail on hypervisor / vmware setups. I didn't see public docs on this on rapids/cudf/rmm side, nor in the hypervisor advisories. I'm simultaneously trying to figure out what the issue is, including scope/workarounds, and helping a big enterprise trying to adopt RAPIDS-stack in a tough env to plan around that. Hopefully the issue rings a bell and we can help save stress for other devs+users as well.

Steps taken to verify documentation is incorrect

Filed this issue. We don't have a vmware testlab, so this is a surprise and currently tricky to understand on our end.

Suggested fix for documentation


Report needed documentation

Report needed documentation
A clear and concise description of what documentation is needed and why.

Steps taken to search for needed documentation
List any steps you have taken.

@harrism
Copy link
Member

harrism commented Dec 11, 2020

@lmeyerov this is documented here: https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#cuda-open-cl-support-vgpu

#656 adds a note to the RMM readme that managed_memory_resource does not work on vGPU.

@harrism harrism removed the ? - Needs Triage Need team to review and classify label Dec 11, 2020
@lmeyerov
Copy link
Author

Thank you for confirming and documenting so quickly, this gives us time over the weekend to get ahead of it for a difficult deployment group!

@lmeyerov
Copy link
Author

lmeyerov commented Dec 18, 2020

@harrism FYI we pushed a patch and tried ='default' for a vSphere env, but no luck. Screenshot thread @ https://rapids-goai.slack.com/archives/C5E06F4DC/p1608324320035700 -- any tips for diagnostics to run as we prep a test kit for the environment?

@lmeyerov
Copy link
Author

lmeyerov commented Dec 18, 2020

  • @kkraus14 Meant to check -- should I start an issue around cudf/rmm on vGPUs? we personally got it working on nutanix last year (and will be doing more in q1). this is my first time trying vsphere, though not sure what some of our more interesting users are doing.

@harrism
Copy link
Member

harrism commented Jan 5, 2021

From the thread, it seems the above is not an RMM issue.

@lmeyerov
Copy link
Author

lmeyerov commented Jan 5, 2021

Yes, thanks @harrism. The docs fix seems fine for guiding folks to writing portable rmm code.

We hit other failure modes around vGPUs. Ex:, CUDA is disabled when the nvidia license is type = 0 (dev / unlicensed mode), which threw our users for 1-2w. In theory, py cuda libs can try to give better error messages, but probably not worth it, and more of a job for lower levels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants