On linux when using NVIDIA GPUs for scientific computation or crypto mining is normal to find that fan speed is not high enough to lower the core temperature. We found two ways to low the GPU temperature and we discuss the solutions in this article.
As an example, room temperature is 29C and after execute EWBF's CUDA Zcash miner for 5 minutes you will see high temps like:
INFO: Detected new work: 1514422114 CUDA: Device: 0 GeForce GTX 1060 6GB, 6072 MB i:64 CUDA: Device: 0 Selected solver: 0 Temp: GPU0: 75C GPU0: 297 Sol/s Total speed: 297 Sol/s INFO: Detected new work: 1514422115 Temp: GPU0: 78C GPU0: 297 Sol/s Total speed: 297 Sol/s INFO: Detected new work: 1514422116 Temp: GPU0: 81C GPU0: 293 Sol/s Total speed: 293 Sol/s WARNING: GPU0 Temperature limit are reached, gpu will be stopped.
As you can see the GPU0 temp is 81C and the miner software stopped the computing.
Solution 1: Limit power usage (very usefull for mining)
It's a common practice when mining to lower the power usage, since it has a low impact in the mining performance. In the next command we lower the power limit to 85w. Make sure you set the limit to a proper value, in our case we had a GTX1060.
nvidia-smi -pm 1 # enable persistance mode nvidia-smi -pl 85
First command is for persistance. After lowering the limit we manage to mine at 72C.
The problem with this solution is low performance on computing, for crypto currency mining this could be a positive side since the impact is low. For scientific computing this could be very negative.
Solution 2: Use nvidia-settings (requires X)
If you are running X (no headless) you can try to set the GPUFanControlState and then the GPUTargetFanSpeed to 100.
nvidia-settings -a "[gpu:0]/GPUFanControlState=1" nvidia-settings -a "[fan:0]/GPUTargetFanSpeed=100"
The good thing of this solution is that you will not loose performance, but you will lower the expected life of fans (they will run 100%). For crypto this is negative, since we want the best performance per watt. However changing the speed of the fan in scientific computing will allow more performance which translates to less computing time for simulations.
ERROR: Error querying enabled displays on GPU 0 (Missing Extension)
ERROR: Error querying connected displays on GPU 0 (Missing Extension). ERROR: Error resolving target specification 'gpu:1' (No targets match target specification), specified in assignment '[gpu:3]/GPUFanControlState=1'.
Try the following command to solve the querying error.
nvidia-xconfig -s -a --force-generate --allow-empty-initial-configuration --cool-bits=12 --registry-dwords="PerfLevelSrc=0x2222" --no-sli --connected-monitor="DFP-0"