Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to change power limit with nvidia-smi #483

Open
machinedgod opened this issue Apr 1, 2023 · 163 comments
Open

Unable to change power limit with nvidia-smi #483

machinedgod opened this issue Apr 1, 2023 · 163 comments
Labels
bug Something isn't working

Comments

@machinedgod
Copy link

NVIDIA Open GPU Kernel Modules Version

530.41.03-1

Does this happen with the proprietary driver (of the same version) as well?

Yes

Operating System and Version

Arch Linux

Kernel Release

Linux 6.2.8-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 22 Mar 2023 22:52:35 +0000 x86_64 GNU/Linux

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 2060 (UUID: GPU-2f685ce6-33f4-db75-ce05-81d1723a6ddb)

Describe the bug

Before recent update, I was able to execute:
~$ sudo nvidia-smi --power-limit 60

and have it work as expected.

After update, this is the output:

Changing power management limit is not supported for GPU: 00000000:01:00.0.
Treating as warning and moving on.
All done.

Changing power limits caused observable changes both in temperature and in performance, so I am pretty sure my GPU supports it.

For context:
I found that default power limit of 80W tends to heat up the GPU enough that it starts throttling itself and cause stuttering - 60W seemed to work perfectly and keep it under 63C during everything, without boosting my fan. The computer itself is a laptop (which explains issues with heat dissipation), a Lenovo Legion Y740, and I have a pretty good cooling pad to help out.

To Reproduce

~# nvidia-smi --power-limit 60

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

Just a bit more context:
you may notice that in my Xorg config, I use option to skip EDID check for HDMI-0 output - the reason why, is because either driver doesn't recognize my (about 7y old) monitor, or the checksum that monitor outputs is invalid, and then it wouldn't let me use FullHD resolution on that monitor.

No idea if this is in any way related (I presume its) not, but this setup worked for over 5-6 months, so I doubt its related.

@machinedgod machinedgod added the bug Something isn't working label Apr 1, 2023
@ghost
Copy link

ghost commented Apr 2, 2023

I also cannot change the power limit with nvidia-smi on this driver while i can on 525 series (I have a RTX 3050 laptop)

On 530 it says "Changing power limit is not supported for this GPU)

So which is intended behavior , make up your mind NVIDIA

I really hope its not the ladder because of this fun issue which also happens to me #435

@dylif
Copy link

dylif commented Apr 5, 2023

Relevant discussion here

@machinedgod
Copy link
Author

Oh so its a wide-reaching problem, not just for 2060 and 3050.
I suppose visibility on it is nice enough, so should probably be fixed soonish.

@msmafra
Copy link

msmafra commented Apr 10, 2023

With 530.30.02 (beta) the power limits were working fine for the first time on my laptop with a RTX 2060 Mobile 6GB. It broke after 530.42.03 update.

@xiguayuyichao
Copy link

Using the 525 version of the driver, my RTX3060 is able to limit power for the first time, but since updating the 530 version of the driver, it has been impossible to limit power

@ghost
Copy link

ghost commented Apr 15, 2023

If anybody here is using a lenovo gaming laptop with the 530 drivers
Can anyone confirm this other power limit issue #492

@youmukonpaku1337
Copy link

got same issue

@youmukonpaku1337
Copy link

on a 3060 AMD legion 5

@kraskden
Copy link

kraskden commented Jun 3, 2023

Legion S7 15ACH6 / AMD Ryzen 7 5800H / NVIDIA GeForce RTX 3060

Same issue

@youmukonpaku1337
Copy link

driver version?

Legion S7 15ACH6 / AMD Ryzen 7 5800H / NVIDIA GeForce RTX 3060

Same issue

@kraskden
Copy link

kraskden commented Jun 3, 2023

driver version?

530.41.03-15

@youmukonpaku1337
Copy link

driver version?

530.41.03-15

try 535 beta

@youmukonpaku1337
Copy link

unless it doesn't exist for okm then test proprietary

@ghost
Copy link

ghost commented Jun 3, 2023

Still have the issue on 535 BETA proprietary which is the only one with the "fix" i think

@youmukonpaku1337
Copy link

hmmm

@youmukonpaku1337
Copy link

test with 525 branch (prop, or okm if exists)

@arutar
Copy link

arutar commented Jun 4, 2023

Ryzen 5 5600H / NVIDIA GeForce RTX 3060 Laptop
Same issue
on version 525.78.01 everything works fine

@youmukonpaku1337
Copy link

Ryzen 5 5600H / NVIDIA GeForce RTX 3060 Laptop Same issue on version 525.78.01 everything works fine

did you test 535 (prop)

@arutar
Copy link

arutar commented Jun 6, 2023

@barokvanzieks Yesterday I checked the work of the power limit on many versions of the drivers.
(Ryzen 5 5600H / NVIDIA GeForce RTX 3060 Laptop / Ubuntu 20.04)
(command: nvidia-smi -pl 50)

  • 525.78.01 power limit works fine
  • 525.85.05 power limit works fine
  • 525.89.02 power limit NOT CHECKED
  • 525.105.17 power limit works fine
  • 530.41.03 power limit is NOT WORKING
  • 535.43.02 power limit is NOT WORKING

it turns out the power limit does not work since the version: 53X.XX.XX

The power limit works on all 525.xx.xx drivers

@youmukonpaku1337
Copy link

hmmmmm ok i havent tested 535 myself, guess ill stay on 525 for now

@Diafwl
Copy link

Diafwl commented Jun 15, 2023

i checked latest version of nvidia driver "535.54.03" with dkms and still NOT WORKING

(i5-12500H / NVIDIA GeForce RTX 3060 Laptop / EndeavourOS)

@ghost
Copy link

ghost commented Jun 15, 2023

I think this is intended at this point

They also unlocked it on windows around driver 528 and locked it back shortly after on laptops so maybe this is only for dekstop GPU's

@arutar
Copy link

arutar commented Jun 16, 2023

@kleidiss This is a very important opportunity. Laptops get very hot and drain the battery quickly. This feature is required!

@machinedgod
Copy link
Author

I agree. It would be worthwhile checking if anyone diffed the sources and figured out how to make a patch to enable it back. If not, I will attempt that this weekend, because right now - my laptop goes to 80-81C (with a REALLY good cooling pad)... previously I kept the framerate stable at 55C

@notnotme
Copy link

notnotme commented Jun 17, 2023

I just noticed it did not work anymore on my system, reverting back to 525.105.17 and everythings good. Thanks you

The laptop is a TongFang GM5MG0O (bios N.1.09A08) // i7 10875H // RTX3070 (bios 94.04.3F.00.83)
The OS is Linux Mint Linux Mint 21.1 // kernel 6.3.8-x64v3-xanmod1

@youmukonpaku1337
Copy link

I agree. It would be worthwhile checking if anyone diffed the sources and figured out how to make a patch to enable it back. If not, I will attempt that this weekend, because right now - my laptop goes to 80-81C (with a REALLY good cooling pad)... previously I kept the framerate stable at 55C

thatd be super cool of you

@youmukonpaku1337
Copy link

but yes 535 isnt workinf for me either and its a pain in the arse since now i have to fuck with drivers to get OpenCL working

@youmukonpaku1337
Copy link

this feature is essential for me since i can up wattage by like 50 watts from stock 80

@youmukonpaku1337
Copy link

I agree. It would be worthwhile checking if anyone diffed the sources and figured out how to make a patch to enable it back. If not, I will attempt that this weekend, because right now - my laptop goes to 80-81C (with a REALLY good cooling pad)... previously I kept the framerate stable at 55C

question is which file would it be in...

@NetUserGet
Copy link

Same issue for me on a Lenovo T15 gen 2
sudo nvidia-smi -pl 72 Changing power management limit is not supported for GPU: 00000000:01:00.0. Treating as warning and moving on. All done.

@Qervas
Copy link

Qervas commented Aug 25, 2024

Lenovo y9000p, nvidia 4060 mobile. The max of power is 140w, but it gets enforced at 55w. nvidia-smi doesn't work at all for -pl command.

@K1D77A
Copy link

K1D77A commented Aug 26, 2024

Lenovo y9000p, nvidia 4060 mobile. The max of power is 140w, but it gets enforced at 55w. nvidia-smi doesn't work at all for -pl command.

Try diver version 525.

@Mistakesos
Copy link

This is really frustrating, especially since RTX2060 is a Turing architecture. nvidia-powerd.service is not supported for Turing architecture. I can't believe they really did this, it's making it impossible for the barely RTX 20 series to perform.

@xepost
Copy link

xepost commented Sep 4, 2024

Lenovo y9000p, nvidia 4060 mobile. The max of power is 140w, but it gets enforced at 55w. nvidia-smi doesn't work at all for -pl command.

Try diver version 525.

There is no 525 anymore in the repos, any other suggestion ? Installing the official 525 from Nvidia will prob not gonna work either.

Image

@polytect
Copy link

polytect commented Sep 4, 2024

I wonder if there is any correlation with high temperatures and gpu performance lock, due to unoptimal automatic power control.

I am using netstat netdata, and I found that one cluster of voltage GPU regulators were at extreme temperatures 95c, while the GPU itself was in 70c. This somewhat affected the unusual power locking.
When I attached a custom fan towards that area dropping voltage regulator temperatures to 70c, it somewhat stayed in normal operation.

I can't verify this for sure until I get proper setup, but netdata helped to find an anomaly which potentially contributed to gpu underperforming state lock.

EDIT: netdata, not netstat

@K1D77A
Copy link

K1D77A commented Sep 4, 2024

Lenovo y9000p, nvidia 4060 mobile. The max of power is 140w, but it gets enforced at 55w. nvidia-smi doesn't work at all for -pl command.

Try diver version 525.

There is no 525 anymore in the repos, any other suggestion ? Installing the official 525 from Nvidia will prob not gonna work either.

Image

Guess its time to swap to Gentoo https://packages.gentoo.org/packages/x11-drivers/nvidia-drivers

@Ramen-LadyHKG
Copy link

However going to 525 means giving up all those wayland patches in the new 555 build

@polytect
Copy link

polytect commented Sep 6, 2024

What does 525 have what 555 does not?

For my expertise this is too deep knowledge, but would be interesting know if it is possible to make future versions of Nvidia being able to adjust power modes for 10, 16, 20 Gpus.

Currently my system struggles to understand when to switch what power mode, sometimes it works Level 1, for no reason, and then Level 3. Temperatures looks alright, so as Power supply.

Maybe it is not proper to share this image, but this is how it works on my 2070 Super MaxQ, using 555 driver:

Image

@K1D77A
Copy link

K1D77A commented Sep 7, 2024

What does 525 have what 555 does not?

For my expertise this is too deep knowledge, but would be interesting know if it is possible to make future versions of Nvidia being able to adjust power modes for 10, 16, 20 Gpus.

Currently my system struggles to understand when to switch what power mode, sometimes it works Level 1, for no reason, and then Level 3. Temperatures looks alright, so as Power supply.

Maybe it is not proper to share this image, but this is how it works on my 2070 Super MaxQ, using 555 driver:

Image

Nvidia simply disabled the feature because the user is rarted apparently.

@polytect
Copy link

polytect commented Sep 7, 2024

Well it is riding motorbike, if the user ignores the sensors rides it on maximum speed in bad conditions, it will crash and burn.
I don't disagree that it can happen...

Either option is not good now, as the automatic power state management is nor safe, nor stable, nor predictable.

It's like having Kawasaki H2R which is capable to go 400 km/h, but sometimes even when riding it on 200 km/h it decides to throttle itself to 100 Km/h for no reason what so ever (or there is reason but not visible), and then suddenly it works normally again , until it doesn't. Imagine the frustration of the rider. I can't give better example.

I will look more in to this, hopefully the code is not as complicated.

Edit:
I realize that it is not safe either. Unless you guys want to install water cooling in to a laptop or to have 100 db server grade turbo fan.

@K1D77A
Copy link

K1D77A commented Sep 7, 2024

Well it is riding motorbike, if the user ignores the sensors rides it on maximum speed in bad conditions, it will crash and burn. I don't disagree that it can happen...

Either option is not good now, as the automatic power state management is nor safe, nor stable, nor predictable.

It's like having Kawasaki H2R which is capable to go 400 km/h, but sometimes even when riding it on 200 km/h it decides to throttle itself to 100 Km/h for no reason what so ever (or there is reason but not visible), and then suddenly it works normally again , until it doesn't. Imagine the frustration of the rider. I can't give better example.

I will look more in to this, hopefully the code is not as complicated.

Edit: I realize that it is not safe either. Unless you guys want to install water cooling in to a laptop or to have 100 db server grade turbo fan.

The GPU Is rated for 65w but its capped at 35w unless I use powerd which doesn't work anyway or I use 525. So your analogy is wrong. What Nvidia has done is remove the ability of a user to change their motorbike sport setting from standard to race mode. Meaning they are constantly capped at half the HP!! If a motorcycle manufacturer did this there would be an uproar.

@Qervas
Copy link

Qervas commented Sep 9, 2024

Lenovo y9000p, nvidia 4060 mobile. The max of power is 140w, but it gets enforced at 55w. nvidia-smi doesn't work at all for -pl command.

Try diver version 525.

There is no 525 anymore in the repos, any other suggestion ? Installing the official 525 from Nvidia will prob not gonna work either.
Image

Guess its time to swap to Gentoo https://packages.gentoo.org/packages/x11-drivers/nvidia-drivers

Successfully with this
by enabling nvidia-powerd.service

55w -> 125w

Image

@xepost
Copy link

xepost commented Sep 13, 2024

@polytect can you explain the steps that you took with netsat?

@Qervas I am already at 150W but the issue is even though I am not using it for anything the fans are running nearly 100%. Do you have this issue on your side ? And does power limit work on 560 ?

nvidia-smi -pl xyz

Fri Sep 13 23:39:36 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 ... Off | 00000000:02:00.0 On | N/A |
| N/A 50C P8 12W / 150W | 1582MiB / 16376MiB | 21% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1639 G /usr/lib/xorg/Xorg 602MiB |
| 0 N/A N/A 2884 C+G ...libexec/gnome-remote-desktop-daemon 239MiB |
| 0 N/A N/A 2920 G /usr/bin/gnome-shell 253MiB |
| 0 N/A N/A 4117 G ...cubations --variations-seed-version 424MiB |
| 0 N/A N/A 4764 G gnome-control-center 3MiB |
+-----------------------------------------------------------------------------------------+

@yuanjv
Copy link

yuanjv commented Sep 16, 2024

I am a debian 12 user. The way I max out the power limit is to install the nvidia-powerd with apt.

sudo apt install nvidia-powerd -y

And run it as root on boot. (I did this through crontab)

sudo crontab -e
@reboot /usr/sbin/nvidia-powerd

@msmafra
Copy link

msmafra commented Sep 16, 2024

Always good to keep in mind that nvidia-powerd /Dynamic Boost can be used as long as you have an Ampere or newer card, not Turing as is what started this issue.

https://download.nvidia.com/XFree86/Linux-x86_64/560.35.03/README/dynamicboost.html

@Calculus8303
Copy link

Fwiw spent over 6 hours chasing custom power drivers, and I don't really recommend that. This is super annoying since there shouldn't be a technical reason to prevent you from adjusting TDP within already used range. Not only was functionality removed, the way the current software ecosystem functions, there's very few practically feasible ways to revert and pin the needed behaviour.

@polytect
Copy link

polytect commented Oct 1, 2024

@xepost Sorry My confusion. I used NetData. Not Netstat
I will correct my post to stop the confusion.

This is NetData local service, dashboards of system logs:

Image

@K1D77A
Copy link

K1D77A commented Oct 4, 2024

FWIW I recently discovered that Gentoo ships the powerd daemon as an OpenRC script when you use the 'powerd' use flag, Its /etc/init.d/nvidia-powerd.

@Interpause
Copy link

Interpause commented Oct 5, 2024

FWIW I recently discovered that Gentoo ships the powerd daemon as an OpenRC script when you use the 'powerd' use flag, Its /etc/init.d/nvidia-powerd.

i was the one who reported it lol: https://bugs.gentoo.org/923117

@K1D77A
Copy link

K1D77A commented Oct 5, 2024

FWIW I recently discovered that Gentoo ships the powerd daemon as an OpenRC script when you use the 'powerd' use flag, Its /etc/init.d/nvidia-powerd.

i was the one who reported it lol: https://bugs.gentoo.org/923117

This is exactly how I realized. Thank you, you legend!

@Harmiess-01
Copy link

Harmiess-01 commented Oct 28, 2024

My Surface Laptop Studio has the same issue on Linux. I did find two work arounds online which work on all distros and could work for other computers. Could others please confirm that these work?

https://github.com/linux-surface/linux-surface/wiki/Surface-Laptop-Studio#nvidia-gpu-locked-at-10w-power-limit

@ThyannSeng
Copy link

ThyannSeng commented Nov 13, 2024

I was also facing the same issue where nvidia-smi -pl stopped working. My RTX 4080 Laptop GPU was drawing up to 150W by default, which caused a lot of heat and noise. Since power limit adjustments aren’t allowed anymore, here’s what worked for me to cap the power around 80W:

1. Check Available Clock Frequencies: Run this to see your GPU’s supported frequency range:

 nvidia-smi -q -d SUPPORTED_CLOCKS

For me, the range was 210 MHz (min) to 3105 MHz (max).

2. Lock GPU Clocks: I set the clock range to balance performance and efficiency:

 sudo nvidia-smi --lock-gpu-clocks=210,1800

3. Enable Persistence Mode: This reduced the power usage to around 80W, which is sufficient for my LLM workloads (like inference) with no significant performance impact:

 sudo nvidia-smi -pm 1

4. Automate Settings: automate the GPU clock and power management settings so they apply at every reboot, I created a systemd service:

Create the Service File: Run the following command to create a new service file:

 sudo nano /etc/systemd/system/nvidia-clock-lock.service

Add the Following Configuration: Paste this code into the file:

 [Unit]
 Description=Set NVIDIA GPU Clock Lock
 After=multi-user.target
 
 [Service]
 Type=oneshot
 ExecStart=/usr/bin/nvidia-smi --lock-gpu-clocks=210,1800
 ExecStartPost=/usr/bin/nvidia-smi -pm 1
 RemainAfterExit=yes
 
 [Install]
 WantedBy=multi-user.target

Save and Enable the Service: Save the file. Then, enable and start the service:

 sudo systemctl enable nvidia-clock-lock.service
 sudo systemctl start nvidia-clock-lock.service

To show the effectiveness of this setup, I’ve included two screenshots:

Idle State: The first screenshot shows the GPU at an idle state, where the power usage is arounf 3W out of 80W, thanks to the minimum clock setting of 210 MHz. Since I am using a second external screen, the GPU remains active to some extent to support the display.

Active LLM Processing: The second screenshot captures the GPU while actively utilizing power for LLM workloads. Here, the power usage stabilizes at 79/80W, showing that the maximum power limit is being fully utilized without exceeding the cap. This keeps the GPU efficient, cool, and quiet while maintaining strong performance for LLM tasks.

Image
Image

This workaround works great for keeping my GPU at 80W, which reduces heat and fan noise while maintaining solid performance for LLM tasks. Hope this helps! 😊

@Ramen-LadyHKG
Copy link

Ramen-LadyHKG commented Nov 13, 2024

sudo nvidia-smi --lock-gpu-clocks=210,1800

sudo nvidia-smi --lock-gpu-clocks=210,1400
[sudo] password for curie: 
Setting locked GPU clocks is not supported for GPU 00000000:02:00.0.
Treating as warning and moving on.
All done.

sadly, Nvidia blocked frequency lock on GTX1060M with proprietary driver

wish I could do that too

@001123
Copy link

001123 commented Nov 17, 2024

Still happen with Ubutu 24.10 with 4060 mobile 😢

@2505552499
Copy link

Still happen with Ubutu 24.10 with 4060 mobile 😢

same am I

@BackgroundPony
Copy link

Following the instructions at https://download.nvidia.com/XFree86/Linux-x86_64/560.35.03/README/dynamicboost.html

With my 3050 mobile on Ubuntu, running nvidia-settings -q DynamicBoostSupport says that my gpu isn't supported.

Ignoring that, I manually installed the service and started it. Testing again under load, I found it was drawing max power, and that the cap was removed.

Hopefully this helps someone

@David112x
Copy link

David112x commented Dec 9, 2024

Seems that Windows drivers don't fare any better, power limits are also locked here.

C:\Windows\System32>nvidia-smi -pl 30
Changing power management limit is not supported for GPU: 00000000:01:00.0.
Treating as warning and moving on.
All done.

Context: I'm trying to lower the power limit while my laptop (with a GTX 1650) runs on the battery, default power limit while on battery is 30W, but I wanted to lower it by half and was unable to.

@DistortedDragon1o4
Copy link

Still an issue with the nvidia driver 565.77 on archlinux. Power limit is set to 35W with dynamic boost to 50W, although my device support a max GPU tgp of 65W

@Harmiess-01
Copy link

Harmiess-01 commented Jan 23, 2025

My Surface Laptop Studio has the same issue on Linux. I did find two work arounds online which work on all distros and could work for other computers. Could others please confirm that these work?

https://github.com/linux-surface/linux-surface/wiki/Surface-Laptop-Studio#nvidia-gpu-locked-at-10w-power-limit

The 525 drivers don't have this issue and seem to work fine for me without any issue (including games which "require" newer drivers). For Arch and derived distros, they can be installed from the AUR. The only notable bug in these drivers is that the GPU can only output up to your monitor's refresh rate. They can be installed on other distros, but it requires more work and may have more issues.

I tried the instructions to install the 525 drivers on a non-Arch distro and it took more work. You need the kernel headers, dkms (optional, for 3rd party kernels), and have to reboot into mode 3. You will also need to patch the drivers for kernel updates (included in the download but may need to be reinstalled with newer patches over time). This worked on Fedora, but my monitor was capped at 60hz (it normally supports 120hz). For both of these reasons, I just switched back to Arch based.

I previously tried disabling D3 mode and that worked as well, but the GPU has to always run and the power is still capped a bit lower than it should be. Still a lot better than normal, but not as good as the 525 drivers in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests