Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement -nvidia=all #1205

Closed
iameli opened this issue Nov 19, 2019 · 14 comments · Fixed by #1840
Closed

Implement -nvidia=all #1205

iameli opened this issue Nov 19, 2019 · 14 comments · Fixed by #1840
Assignees

Comments

@iameli
Copy link
Contributor

iameli commented Nov 19, 2019

"Please run this transcoder on all nvidia GPUs available" I think is going to be a pretty common case — it'd be nice if Kubernetes manifests and whatnot could contain a line like-nvidia=all so that I can have the same command work across a variety of hardware.

@iameli
Copy link
Contributor Author

iameli commented Nov 21, 2019

(I kind of think that doing just -nvidia=0,1,2,3,4,5,6,7,8,9,10,11,12 up to some arbitrarily high limit works just fine though)

EDIT: It does not. It tries to transcode on the card, then comes back with an "invalid ordinal" error.

@AbAb1l AbAb1l self-assigned this Feb 3, 2020
@iameli
Copy link
Contributor Author

iameli commented Feb 5, 2020

Thinking about this more, I think I'm more in favor of -nvidia=all rather than -nvidia=* to avoid folks accidentally globbing with the *.

@iameli iameli changed the title Implement -nvidia='*' Implement -nvidia=all Feb 5, 2020
@iameli
Copy link
Contributor Author

iameli commented Feb 18, 2020

@j0sh @AbAb1l Any update on this? It's not impossible to work around, but currently we're deploying three different sorts of deployments with different command line parameters to account for different sorts of boxes; this would make things cleaner

@iameli iameli assigned ya7ya and unassigned AbAb1l Feb 9, 2021
@iameli
Copy link
Contributor Author

iameli commented Feb 11, 2021

@ya7ya came up with this which works quite well, perhaps we could do something like this?:

nvidia-smi --query-gpu=index --format=csv,noheader | sed -z 's/\\n/,/g;s/,$/\\n/'

@yondonfu
Copy link
Member

yondonfu commented Apr 1, 2021

nvidia-smi --query-gpu=index --format=csv,noheader | sed -z 's/\\n/,/g;s/,$/\\n/' seems to return a new line delimited string:

0
1

nvidia-smi --query-gpu=index --format=csv,noheader | tr "\n" "," | sed 's/,$//' seems to work though:

0,1

That being said, device enumeration seems to be implemented directly using the CUDA API in other programs like ethminer and ffmpeg. I think nvidia-smi should be available on any machine that has a driver installed, but it could be nice to have the device enumeration be baked directly into LPMS to avoid an explicit dependency on an external binary.

I've also noticed that the default behavior for other programs that use Nvidia GPUs like ethminer and t-rex is to enumerate all devices by default if no device IDs are specified. Instead of requiring -nvidia=all to enumerate all device IDs, the behavior could be:

  • If -nvidia is not specified, Nvidia transcoding is disabled
  • If -nvidia is specified without any arguments, Nvidia transcoding is enabled and all devices are enumerated
  • If -nvidia is specified with a comma delimited string, Nvidia transcoding is enabled and the devices specified are used

@jailuthra
Copy link
Contributor

Directly using cuda api like ethminer and ffmpeg will be tricky for us. We do not have a direct dependency on nvidia's libs and headers. Ffmpeg loads them internally, but does not provide a way access it externally. We could try loading it directly in LPMS similar to what FFmpeg does via ffnvcodec/dynlink_loader.h - but it will be a pain to setup and test.

it could be nice to have the device enumeration be baked directly into LPMS to avoid an explicit dependency on an external binary.

Good news: Nvidia Management Library (NVML), the underlying lib for nvidia-smi, is relatively straightforward to setup. It might also come in handly later to query utilization or other metrics.

A Cgo wrapper for NVML can be used directly to iterate over available devices. The wrapper links with libnvidia-ml.so.1 - which is shipped with nvidia drivers on Ubuntu and Arch linux. This wrapper worked out-of-the-box during my tests on linux, not sure about Windows yet.

If -nvidia is specified without any arguments, Nvidia transcoding is enabled and all devices are enumerated

👍

@yondonfu
Copy link
Member

yondonfu commented Apr 12, 2021

There also seems to be official NVML Go bindings from Nvidia, but that is Linux only and Windows support has not been added yet (sounds like it was supported in an older version of the bindings based on the comments).

@darkdarkdragon
Copy link
Contributor

I think that Windows machines with more than one GPU will be rare and certainly Windows will not be used on mining farms - so I think we can support -nvidia=all only on Linux.

@jailuthra
Copy link
Contributor

Summarizing the above discussion around NVML. There are 3 go wrappers -
(1) https://github.com/mindprince/gonvml - Community run, directly loads libnvidia-ml.so so I presume no windows support
(2) https://github.com/NVIDIA/go-nvml - Official NVIDIA bindings, dedicated to NVML, but lacks windows supports right now (it seems to be possible to add it later, from the open issue's discussion)
(3) https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/bindings/go/nvml - Official NVIDIA monitoring toolkit for docker/k8s env - it supports NVML on windows too, but it isn't usable as a standalone binding

For now implementing this with (2), sticking to linux support only. If we see demand for this on Windows (or if I have spare time) we can probably add it ourselves and send a PR on the upstream issue.

@iameli
Copy link
Contributor Author

iameli commented Apr 14, 2021

@jailuthra What about this? Looks like it can list GPUs in a cross-platform kind of way. https://github.com/jaypipes/ghw#gpu

Edit: Tested it on Windows (mingw64) and this example script seemed to work:

package main

import (
	"fmt"

	"github.com/jaypipes/ghw"
)

func main() {
	gpu, err := ghw.GPU()
	if err != nil {
		fmt.Printf("Error getting GPU info: %v", err)
	}

	fmt.Printf("%v\n", gpu)

	for _, card := range gpu.GraphicsCards {
		fmt.Printf(" %v\n", card)
	}
}

Produced this output:

$ ./gpu-detection.exe
gpu (1 graphics card)
 card #0 @PCI\\VEN_10DE&DEV_1B06&SUBSYS_374C1458&REV_A1\\4&1FC990D7&0&0019 -> class: 'unknown' vendor: 'NVIDIA' product: 'NVIDIA GeForce GTX 1080 Ti'

Testing on the same machine booted into Linux now...

@jailuthra
Copy link
Contributor

jailuthra commented Apr 14, 2021

@jailuthra What about this? Looks like it can list GPUs in a cross-platform kind of way. https://github.com/jaypipes/ghw#gpu

Neat find! I had already implemented a fix using go-nvml and it works on my linux machine - but importing the library is making our windows CI build fail 😞

If an easy fix for that isn't possible I'll switch to ghw just to keep the build process sane - although it won't really work on windows for multiple chipsets, as it hardcodes gpu device id as 0

edit: ahh but we could use the number of chipsets returned on windows ^ and create our own array of ids like we're already doing with go-nvml.

@iameli
Copy link
Contributor Author

iameli commented Apr 14, 2021

Right on, that'd probably work. My Ubuntu installation on that machine seems to be broken, but here's the same script on a 8-GPU rig in BER:

./gpu-detection
gpu (8 graphics cards)
 card #0 @0000:01:00.0 -> class: 'Display controller' vendor: 'NVIDIA Corporation' product: 'TU116 [GeForce GTX 1660 SUPER]'
 card #1 @0000:02:00.0 -> class: 'Display controller' vendor: 'NVIDIA Corporation' product: 'TU116 [GeForce GTX 1660]'
 card #2 @0000:03:00.0 -> class: 'Display controller' vendor: 'NVIDIA Corporation' product: 'TU116 [GeForce GTX 1660]'
 card #3 @0000:04:00.0 -> class: 'Display controller' vendor: 'NVIDIA Corporation' product: 'TU116 [GeForce GTX 1660 SUPER]'
 card #4 @0000:05:00.0 -> class: 'Display controller' vendor: 'NVIDIA Corporation' product: 'TU116 [GeForce GTX 1660 SUPER]'
 card #5 @0000:06:00.0 -> class: 'Display controller' vendor: 'NVIDIA Corporation' product: 'TU116 [GeForce GTX 1660]'
 card #6 @0000:07:00.0 -> class: 'Display controller' vendor: 'NVIDIA Corporation' product: 'TU116 [GeForce GTX 1660]'
 card #7 @0000:08:00.0 -> class: 'Display controller' vendor: 'NVIDIA Corporation' product: 'TU116 [GeForce GTX 1660]'

Windows says vendor: 'NVIDIA' and Linux says vendor: 'NVIDIA Corporation' - I suppose we just allow any that contain the word NVIDIA?

@jailuthra
Copy link
Contributor

jailuthra commented Apr 14, 2021

@iameli Perfect, I've switched to ghw and it's working great on my linux machine too!

Windows says vendor: 'NVIDIA' and Linux says vendor: 'NVIDIA Corporation' - I suppose we just allow any that contain the word NVIDIA?

Yeah somehow I did exactly that without having that info :P

if strings.EqualFold(card.DeviceInfo.Vendor.Name[:6], "nvidia") {

@jailuthra
Copy link
Contributor

cc @yondonfu

If -nvidia is specified without any arguments, Nvidia transcoding is enabled and all devices are enumerated

Golang's flag library does not support empty values for string flags (it only works for boolean flags like -testTranscoder)

Something like -nvidia= would have worked, but imo that would have been confusing and anyhow different from ethminer/t-rex. For now I've sticked to -nvidia=all unless a cleaner solution is feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants