Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not iterating through all GPUs in system (3) #7

Open
Delitants opened this issue Jul 21, 2020 · 30 comments
Open

Not iterating through all GPUs in system (3) #7

Delitants opened this issue Jul 21, 2020 · 30 comments

Comments

@Delitants
Copy link

Not iterating through all GPUs in system (3 of them), stuck at 1st in the list and hangs forever. 1070, 750 Ti, 950

@Delitants
Copy link
Author

Delitants commented Jul 21, 2020

/usr/bin/coolgpus --speed 60 60
No existing X servers, we're good to go
Starting xserver: Xorg :0 -once -config /tmp/cool-gpu-00000000:01:00.0rr5gi2u3/xorg.conf
Starting xserver: Xorg :1 -once -config /tmp/cool-gpu-00000000:05:00.0f1ewmqu3/xorg.conf
Starting xserver: Xorg :2 -once -config /tmp/cool-gpu-00000000:09:00.0v06tfhl_/xorg.conf

X.Org X Server 1.19.5
Release Date: 2017-10-12
X Protocol Version 11, Revision 0
Build Operating System:  3.10.0-693.17.1.el7.x86_64
Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64
Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet
Build Date: 29 October 2018  03:33:19PM
Build ID: xorg-x11-server 1.19.5-5.1.el7_5.0.1
Current version of pixman: 0.34.0
        Before reporting problems, check http://wiki.x.org
        to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
        (++) from command line, (!!) notice, (II) informational,
        (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Tue Jul 21 00:55:38 2020
(++) Using config file: "/tmp/cool-gpu-00000000:01:00.0rr5gi2u3/xorg.conf"
(==) Using config directory: "/etc/X11/xorg.conf.d"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"

X.Org X Server 1.19.5
Release Date: 2017-10-12
X Protocol Version 11, Revision 0
Build Operating System:  3.10.0-693.17.1.el7.x86_64
Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64
Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet
Build Date: 29 October 2018  03:33:19PM
Build ID: xorg-x11-server 1.19.5-5.1.el7_5.0.1
Current version of pixman: 0.34.0
        Before reporting problems, check http://wiki.x.org
        to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
        (++) from command line, (!!) notice, (II) informational,
        (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.1.log", Time: Tue Jul 21 00:55:38 2020
(++) Using config file: "/tmp/cool-gpu-00000000:05:00.0f1ewmqu3/xorg.conf"
(==) Using config directory: "/etc/X11/xorg.conf.d"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"

X.Org X Server 1.19.5
Release Date: 2017-10-12
X Protocol Version 11, Revision 0
Build Operating System:  3.10.0-693.17.1.el7.x86_64
Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64
Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet
Build Date: 29 October 2018  03:33:19PM
Build ID: xorg-x11-server 1.19.5-5.1.el7_5.0.1
Current version of pixman: 0.34.0
        Before reporting problems, check http://wiki.x.org
        to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
        (++) from command line, (!!) notice, (II) informational,
        (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.2.log", Time: Tue Jul 21 00:55:38 2020
(++) Using config file: "/tmp/cool-gpu-00000000:09:00.0v06tfhl_/xorg.conf"
(==) Using config directory: "/etc/X11/xorg.conf.d"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
GPU :0, 58C -> [60%-60%]. Setting speed to 60%

_**hanged**_
^C
Released fan speed control for GPU at :0
_**hanged**_
nvidia-smi
Tue Jul 21 01:12:22 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:01:00.0 Off |                  N/A |
| 60%   53C    P2    69W / 151W |   2997MiB /  8119MiB |     23%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 750 Ti  Off  | 00000000:05:00.0 Off |                  N/A |
| 48%   56C    P0    30W /  38W |   1229MiB /  2002MiB |     83%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 950     Off  | 00000000:09:00.0 Off |                  N/A |
| 32%   71C    P0    45W /  75W |    974MiB /  2002MiB |     51%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

@andyljones
Copy link
Owner

andyljones commented Jul 21, 2020

There's some troubleshooting advice on the main page; mind stepping through it?

If you're uncomfortable with pdb, looking at where things are hanging I think a good alternative would be to add print(command) above this line and see what command is making things hang. Then try and run that command manually.

@Delitants
Copy link
Author

Delitants commented Jul 21, 2020

There's some troubleshooting advice on the main page; mind stepping through it?

If you're uncomfortable with pdb, looking at where things are hanging I think a good alternative would be to add print(command) above this line and see what command is making things hang. Then try and run that command manually.

I didn't understand what pdb should do, I saw no changes.

Added print,

(==) Log file: "/var/log/Xorg.1.log", Time: Tue Jul 21 01:40:49 2020
(++) Using config file: "/tmp/cool-gpu-00000000:05:00.0prjzqd_m/xorg.conf"
(==) Using config directory: "/etc/X11/xorg.conf.d"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=1', '-c', ':0']
['nvidia-settings', '-a', '[fan:0]/GPUTargetFanSpeed=60', '-c', ':0']
GPU :0, 60C -> [60%-60%]. Setting speed to 60%
['nvidia-smi', '--format=csv,noheader', '--query-gpu=temperature.gpu', '-i', '00000000:05:00.0']
['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=1', '-c', ':1']
['nvidia-settings', '-a', '[fan:0]/GPUTargetFanSpeed=60', '-c', ':1']

this is it

@andyljones
Copy link
Owner

Try running

nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :1

in a terminal.

Adding .set_trace() somewhere should drop you in an interactive debugger prompt when you run the program.

@Delitants
Copy link
Author

nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :1

ERROR: Unable to find display on any available system

ERROR: Unable to find display on any available system

First 2 gpus have no physical displays connected, 3rd one has a physical display connected. However, even without physical display, problem is the same.

@Delitants
Copy link
Author

I'd like to mention, that I was able to set all fans in some random attempts, I don't even remember what I was doing, rebooted couple times and reinstalled driver, replugged physical display, and I had script passed a few times, but after closing it and running again, it hangs.

@Delitants
Copy link
Author

Delitants commented Jul 21, 2020

I don't get this...

/usr/bin/coolgpus --speed 60 60
> /usr/bin/coolgpus(11)<module>()
-> parser = argparse.ArgumentParser(description=r'''
(Pdb)
(Pdb)
(Pdb)
(Pdb) help
(Pdb) run
Traceback (most recent call last):
  File "/usr/bin/coolgpus", line 11, in <module>
    parser = argparse.ArgumentParser(description=r'''
  File "/usr/bin/coolgpus", line 11, in <module>
    parser = argparse.ArgumentParser(description=r'''
  File "/usr/lib64/python3.6/bdb.py", line 51, in trace_dispatch
    return self.dispatch_line(frame)
  File "/usr/lib64/python3.6/bdb.py", line 69, in dispatch_line
    self.user_line(frame)
  File "/usr/lib64/python3.6/pdb.py", line 261, in user_line
    self.interaction(frame, None)
  File "/usr/lib64/python3.6/pdb.py", line 352, in interaction
    self._cmdloop()
  File "/usr/lib64/python3.6/pdb.py", line 321, in _cmdloop
    self.cmdloop()
  File "/usr/lib64/python3.6/cmd.py", line 138, in cmdloop
    stop = self.onecmd(line)
  File "/usr/lib64/python3.6/pdb.py", line 418, in onecmd
    return cmd.Cmd.onecmd(self, line)
  File "/usr/lib64/python3.6/cmd.py", line 217, in onecmd
    return func(arg)
  File "/usr/lib64/python3.6/pdb.py", line 1028, in do_run
    raise Restart
pdb.Restart

@andyljones
Copy link
Owner

coolgpus will not work on any system with a display or any system that's expecting a display. You'll need to remove the display, restart, SSH in, and toy around until

nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :1

works for -c :0, -c :1, -c :2. Googling the error message has some possible directions. But this is, I'm afraid, much more about debugging your system setup than it is about debugging the script.

@andyljones
Copy link
Owner

Take a look at the pdb docs. No longer useful for this problem, but overall one of the most useful tools in Python programming. Especially the .pm() bit.

@Delitants
Copy link
Author

I did not have a display connected initially, it makes no difference. I've connected it at the last attempt to see what changes. Well, nothing.

@Delitants
Copy link
Author

nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :1

ERROR: Unable to find display on any available system

ERROR: Unable to find display on any available system

[root@nvidia-2 ~]# nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :0

ERROR: Unable to find display on any available system

ERROR: Unable to find display on any available system

[root@nvidia-2 ~]# nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :2

ERROR: Unable to find display on any available system

ERROR: Unable to find display on any available system

[root@nvidia-2 ~]# nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :3

ERROR: Unable to find display on any available system

ERROR: Unable to find display on any available system

[root@nvidia-2 ~]# nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :4

ERROR: Unable to find display on any available system

ERROR: Unable to find display on any available system

@Delitants
Copy link
Author

System is a plain Centos 7 with nvidia driver for headless transcoding, there is nothing custom to debug on it.

@Delitants
Copy link
Author

"Unable to find display on any available system"
Why would it find if I removed the physical display as you suggested? Doesn't make any sense.

@andyljones
Copy link
Owner

andyljones commented Jul 21, 2020

Not to discourage you too much but: it feels like you're hoping that I have more knowledge about this than I do. Your system absolutely has something to debug, as you can tell by the way a thing you want to do isn't working as you'd expect it to.

If you want to push forward with this, a general loop should be:

  • Take anything you know about the problem you're seeing (ie, ERROR: Unable to find display on any available system) and Google until you find people with similar problems.
  • Try out their fixes.
  • If their fixes don't work, think about how your case differs from their case, or what you can do to more accurately isolate the problem you're seeing.
  • Go back to doing more Googling.

It's hard! This might take hours or days! You might have to learn huge amounts about subjects that are totally irrelevant, just to check one possible fix! It probably won't be worth it! But, frankly: the only other choice is to give up and decide you don't care that much about coolgpu's functionality.

@Delitants
Copy link
Author

Unable to find display on any available - is literally what it says, no displays attached, either physical or virtual. Doesn't your script do a virtual displays to set fan speeds? Apparently it does, because I'm able to run "nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :2" but no ouput comes, while having your script open in another SSH window. So I don't understand what exactly or why "Unable to find display on any available" message has to be fixed. It is expected and not related to the issue described.

@andyljones
Copy link
Owner

andyljones commented Jul 21, 2020

OK - in that case, use pdb, use print statements - figure out where it is the script is actually hanging, then make sure you can replicate it yourself, then Google around to figure out what's causing that hanging.

You may need to replicate the xserver setup the script is doing to replicate the hanging. You might want to tear the xserver setup bit of coolgpus out into your own script, then run that and leave it running in the background while you experiment. There're lots of ways forward, just requires a bit of ingenuity!

@Delitants
Copy link
Author

Delitants commented Jul 21, 2020

I did print already and posted earlier, it hangs on

['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=1', '-c', ':0']
['nvidia-settings', '-a', '[fan:0]/GPUTargetFanSpeed=60', '-c', ':0']
GPU :0, 60C -> [60%-60%]. Setting speed to 60%
['nvidia-smi', '--format=csv,noheader', '--query-gpu=temperature.gpu', '-i', '00000000:05:00.0']
['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=1', '-c', ':1']
^C
Google is useless here, I've spent 5 hours today before writing here.

I always have to kill Xorg like this, because it never exists.
killall Xorg -9

Reinstalling xorg server does not help.

@Delitants
Copy link
Author

Ok, let's make it easier. I need to adjust only last GPU in list, how to select only one GPU with this script and skip others?

@andyljones
Copy link
Owner

Right! That's the spirit.

The answer is: there's no built in method. Try cloning this repo and editing the script yourself; add a conditional somewhere to only look at specific GPUs.

More generally, you know where the script hangs but you haven't isolated the aberrant behaviour. You want to be able to enter a series of commands into the terminal and get the same hang. Then you can experiment freely with that series of commands, try different versions, add --verbose flags, etc etc etc.

@v-iashin
Copy link
Contributor

I think I get the same error when running nvidia-settings from terminal

$ nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :0
Unable to init server: Could not connect: Connection refused
ERROR: Unable to find display on any available system
ERROR: Unable to find display on any available system

At the same time, I am able to run coolgpus from a conda environment. @Neolo can you try to run the script from this environment

name: coolgpus
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - ca-certificates=2020.1.1=0
  - certifi=2019.11.28=py38_0
  - ld_impl_linux-64=2.33.1=h53a641e_7
  - libedit=3.1.20181209=hc058e9b_0
  - libffi=3.2.1=hd88cf55_4
  - libgcc-ng=9.1.0=hdf63c60_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - ncurses=6.2=he6710b0_0
  - openssl=1.1.1d=h7b6447c_4
  - pip=20.0.2=py38_1
  - python=3.8.1=h0371630_1
  - readline=7.0=h7b6447c_5
  - setuptools=45.2.0=py38_0
  - sqlite=3.31.1=h7b6447c_0
  - tk=8.6.8=hbc83047_0
  - wheel=0.34.2=py38_0
  - xz=5.2.4=h14c3975_4
  - zlib=1.2.11=h7b6447c_3
  - pip:
    - coolgpus==0.17

@andyljones
Copy link
Owner

andyljones commented Jul 21, 2020

Yep, I think @Neolo is right about the ERROR being a symptom of the missing xserver env. Still expect that command to be the source of the hang since it was the last command printed, just gonna need more work to get a manual reproduction.

I'll be surprised if the env is causing the hang, but it's a good idea since it's an easy thing to check.

@Delitants
Copy link
Author

So weird. I just removed contents of /etc/X11/ and ran

nvidia-xconfig --allow-empty-initial-configuration --enable-all-gpus --cool-bits=28 --separate-x-screens --enable-all-gpus --use-display-device=none

Using X configuration file: "/etc/X11/xorg.conf".
Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen0".
Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen0 (1)".
Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen0 (2)".
Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen0 (3)".
Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen1".
Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen1 (1)".
Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen1 (2)".
Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen1 (3)".
Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen2".
Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen2 (1)".
Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen2 (2)".
Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen2 (3)".
Backed up file '/etc/X11/xorg.conf' as '/etc/X11/xorg.conf.backup'
New X configuration file written to '/etc/X11/xorg.conf'

and right after for one time only I was able to pass fan speeds to 1st AND 2nd gpus only, 3rd hanged.
Closed the script, ran again - same problem, only 1st gpu now, again...

@Delitants
Copy link
Author

At the same time, I am able to run coolgpus from a conda environment. @Neolo can you try to run the script from this environment

I'm not into python, tell me what to run, I didn't get it.

@v-iashin
Copy link
Contributor

I'm not into python, tell me what to run, I didn't get it.

  1. https://conda.io/projects/conda/en/latest/user-guide/install/linux.html (please follow the instructions for Miniconda)
  2. Verify installation (conda --help)
  3. Save the environment I shared with you into env.yml
  4. conda env create -f env.yml -- it should install the environment to your machine
  5. conda activate env -- activating the virtual env
  6. Run some examples from the README.md to see if it hangs in the same way.

@Delitants
Copy link
Author

Delitants commented Jul 21, 2020

Will try that environment tomorrow.

The answer is: there's no built in method. Try cloning this repo and editing the script yourself; add a conditional somewhere to only look at specific GPUs.

As for now, I just made a dirty trick to select a last GPU from the list, which is "burning" right now at 72 C.

def gpu_buses():
#    return log_output(['nvidia-smi', '--format=csv,noheader', '--query-gpu=pci.bus_id']).splitlines()
    return '00000000:09:00.0'.splitlines()

and it sets the speed fine, no hangs,

@Delitants
Copy link
Author

Delitants commented Jul 22, 2020

I'm not into python, tell me what to run, I didn't get it.

  1. https://conda.io/projects/conda/en/latest/user-guide/install/linux.html (please follow the instructions for Miniconda)
  2. Verify installation (conda --help)
  3. Save the environment I shared with you into env.yml
  4. conda env create -f env.yml -- it should install the environment to your machine
  5. conda activate env -- activating the virtual env
  6. Run some examples from the README.md to see if it hangs in the same way.

Installed Miniconda, activated env, running "$(which coolgpus) --temp 60 60" just doesn't do anything, not even setting the first GPU at all.


[root@nvidia-2 ~]# conda env create -f env.yml
[root@nvidia-2 ~]# conda activate coolgpus
(coolgpus) [root@nvidia-2 ~]# conda -V
conda 4.8.3
(coolgpus) [root@nvidia-2 ~]# $(which coolgpus) --temp 60 60
No existing X servers, we're good to go
Starting xserver: Xorg :0 -once -config /tmp/cool-gpu-00000000:01:00.0qa_grbj8/xorg.conf
Starting xserver: Xorg :1 -once -config /tmp/cool-gpu-00000000:05:00.089yczaej/xorg.conf
Starting xserver: Xorg :2 -once -config /tmp/cool-gpu-00000000:09:00.0khgo7uqs/xorg.conf

X.Org X Server 1.19.3
Release Date: 2017-03-15
X Protocol Version 11, Revision 0
Build Operating System: 3.10.0-514.16.1.el7.x86_64
Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64
Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet
Build Date: 05 August 2017 06:19:43AM
Build ID: xorg-x11-server 1.19.3-11.el7
Current version of pixman: 0.34.0
Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Wed Jul 22 19:06:50 2020
(++) Using config file: "/tmp/cool-gpu-00000000:01:00.0qa_grbj8/xorg.conf"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"

X.Org X Server 1.19.3
Release Date: 2017-03-15
X Protocol Version 11, Revision 0
Build Operating System: 3.10.0-514.16.1.el7.x86_64
Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64
Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet
Build Date: 05 August 2017 06:19:43AM
Build ID: xorg-x11-server 1.19.3-11.el7
Current version of pixman: 0.34.0
Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.2.log", Time: Wed Jul 22 19:06:50 2020
(++) Using config file: "/tmp/cool-gpu-00000000:09:00.0khgo7uqs/xorg.conf"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"

X.Org X Server 1.19.3
Release Date: 2017-03-15
X Protocol Version 11, Revision 0
Build Operating System: 3.10.0-514.16.1.el7.x86_64
Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64
Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet
Build Date: 05 August 2017 06:19:43AM
Build ID: xorg-x11-server 1.19.3-11.el7
Current version of pixman: 0.34.0
Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.1.log", Time: Wed Jul 22 19:06:50 2020
(++) Using config file: "/tmp/cool-gpu-00000000:05:00.089yczaej/xorg.conf"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
Released fan speed control for GPU at :0
Released fan speed control for GPU at :1

^C

@Gimel12
Copy link

Gimel12 commented Oct 1, 2020

I am having the same issue and nobody can solve the issue, What I do is to create a custom xorg file that its working for me, and then on the nvidia app on ubuntu I can on powermizer change the settings of each fan for each GPUs, this is not usefull when working via SSH on headless but unfourtunatly there is no solution anywhere for an easy to use nvidia-settings.

So basically this is what worked for me

Table of content:

How change GPU fan speeds in Ubuntu

1- In the applications, open NVIDIA X Server Settings

2- Select the GPU currently used for display output (should be the GPU in first PCIe slot)

3- Take note of the Bus ID

4- Run the following commands

*sudo nvidia-xconfig --enable-all-gpus
sudo nvidia-xconfig --cool-bits=28
sudo reboot*

5- After the computer reboots, plug the monitor into the last GPU

6- Open NVIDIA X-Server Settings again

7- Select the GPU currently used for display output

8- Take note of the Bus ID

9- Run sudo nano /etc/X11/xorg.conf The GPUs will be listed in “Device” sections with formatting similar to this:

**Section** “Device”
**Identifier** “name”
**Driver** “driver”entries…
**EndSection**

10- Identify the GPUs with the Bus IDs that were previously noted

11- Swap the Bus IDs of the two GPUs

12- Press Ctrl+X to close “xorg.conf”

13- Press Y to save the file

14- Press “Enter” without changing the file name

15- Reboot

Fan speeds can now be changed from NVIDIA X Server Settings by selecting the Thermal Settings for each GPU and checking the option to “Enable GPU Fan Settings”
Set the fan speed with the slider and click “Apply” to save it

@Delitants
Copy link
Author

Never version - worse it works.

`(==) Log file: "/var/log/Xorg.2.log", Time: Mon Jan 25 22:02:05 2021
(++) Using config file: "/tmp/cool-gpu-00000000:09:00.0ngck_2l3/xorg.conf"
(==) Using config directory: "/etc/X11/xorg.conf.d"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
GPU :0, 66C -> [60%-65%]. Setting speed to 60%
GPU :1, 36C -> [30%-30%]. Setting speed to 30%

Command timed out: nvidia-settings -a [gpu:0]/GPUFanControlState=1 -c :1

Released fan speed control for GPU at :0
Command timed out: nvidia-settings -a [gpu:0]/GPUFanControlState=0 -c :1

Terminating xserver for display :0
Terminating xserver for display :1
Terminating xserver for display :2
Traceback (most recent call last):
File "/usr/bin/coolgpus", line 89, in log_output
p.wait(60)
File "/usr/lib64/python3.6/subprocess.py", line 1469, in wait
raise TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command '['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=1', '-c', ':1']' timed out after 60 seconds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/bin/coolgpus", line 239, in manage_fans
set_speed(display, s)
File "/usr/bin/coolgpus", line 224, in set_speed
assign(display, '[gpu:0]/GPUFanControlState=1')
File "/usr/bin/coolgpus", line 221, in assign
log_output(['nvidia-settings', '-a', command, '-c', display])
File "/usr/bin/coolgpus", line 102, in log_output
raise ValueError('Command crashed with return code ' + str(p.returncode) + ': ' + ' '.join(command))
ValueError: Command crashed with return code None: nvidia-settings -a [gpu:0]/GPUFanControlState=1 -c :1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/bin/coolgpus", line 89, in log_output
p.wait(60)
File "/usr/lib64/python3.6/subprocess.py", line 1469, in wait
raise TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command '['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=0', '-c', ':1']' timed out after 60 seconds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/bin/coolgpus", line 266, in
run()
File "/usr/bin/coolgpus", line 263, in run
manage_fans(displays)
File "/usr/bin/coolgpus", line 246, in manage_fans
assign(display, '[gpu:0]/GPUFanControlState=0')
File "/usr/bin/coolgpus", line 221, in assign
log_output(['nvidia-settings', '-a', command, '-c', display])
File "/usr/bin/coolgpus", line 102, in log_output
raise ValueError('Command crashed with return code ' + str(p.returncode) + ': ' + ' '.join(command))
ValueError: Command crashed with return code None: nvidia-settings -a [gpu:0]/GPUFanControlState=0 -c :1`

@Delitants
Copy link
Author

Delitants commented Jan 26, 2021

Jan 25 22:17:24 nvidia-2 coolgpus: File "/usr/bin/coolgpus", line 266, in
Jan 25 22:17:24 nvidia-2 coolgpus: run()
Jan 25 22:17:24 nvidia-2 coolgpus: File "/usr/bin/coolgpus", line 259, in run
Jan 25 22:17:24 nvidia-2 coolgpus: with xservers(buses) as displays:
Jan 25 22:17:24 nvidia-2 coolgpus: File "/usr/lib64/python3.6/contextlib.py", line 81, in enter
Jan 25 22:17:24 nvidia-2 coolgpus: return next(self.gen)
Jan 25 22:17:24 nvidia-2 coolgpus: File "/usr/bin/coolgpus", line 172, in xservers
Jan 25 22:17:24 nvidia-2 coolgpus: kill_xservers()
Jan 25 22:17:24 nvidia-2 coolgpus: File "/usr/bin/coolgpus", line 161, in kill_xservers
Jan 25 22:17:24 nvidia-2 coolgpus: raise IOError('Failed to kill existing X servers. Try killing them yourself before running this script')
Jan 25 22:17:24 nvidia-2 coolgpus: OSError: Failed to kill existing X servers. Try killing them yourself before running this script
Jan 25 22:17:24 nvidia-2 systemd: coolgpus.service: main process exited, code=exited, status=1/FAILURE
Jan 25 22:17:24 nvidia-2 kill: Usage:
Jan 25 22:17:24 nvidia-2 systemd: coolgpus.service: control process exited, code=exited status=1
Jan 25 22:17:24 nvidia-2 kill: kill [options] <pid|name> [...]
Jan 25 22:17:24 nvidia-2 kill: Options:
Jan 25 22:17:24 nvidia-2 kill: -a, --all do not restrict the name-to-pid conversion to processes
Jan 25 22:17:24 nvidia-2 kill: with the same uid as the present process
Jan 25 22:17:24 nvidia-2 kill: -s, --signal send specified signal
Jan 25 22:17:24 nvidia-2 kill: -q, --queue use sigqueue(2) rather than kill(2)
Jan 25 22:17:24 nvidia-2 kill: -p, --pid print pids without signaling them
Jan 25 22:17:24 nvidia-2 kill: -l, --list [=] list signal names, or convert one to a name
Jan 25 22:17:24 nvidia-2 kill: -L, --table list signal names and numbers
Jan 25 22:17:24 nvidia-2 kill: -h, --help display this help and exit
Jan 25 22:17:24 nvidia-2 kill: -V, --version output version information and exit
Jan 25 22:17:24 nvidia-2 kill: For more details see kill(1).
Jan 25 22:17:24 nvidia-2 systemd: Unit coolgpus.service entered failed state.
Jan 25 22:17:24 nvidia-2 systemd: coolgpus.service failed.
Jan 25 22:17:29 nvidia-2 systemd: coolgpus.service holdoff time over, scheduling restart.
Jan 25 22:17:29 nvidia-2 systemd: Starting Headless GPU Fan Control...
Jan 25 22:17:39 nvidia-2 systemd: Started Headless GPU Fan Control.

**for Gods sake.... ridiculous
Open /usr/bin/coolgpus
add on top:

import subprocess

and in the function kill_xservers() on top of it:

subprocess.run(["killall", "Xorg -9"])
return

ditch the rest of this function. Solved.**

@ghost
Copy link

ghost commented Apr 8, 2021

For anyone still experiencing this issue, I have slapped together a bash script which at least allows for setting a fixed fan speed for all GPU in the system, regardless if a monitor is attached. It supports amdgpu too: https://github.com/lavanoid/Linux_GPU_Fan_Control

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants