-
Hi, I'm doing some CUDA testing with a non-compute-focused GPU (so the FP64:FP32 ratio is 1:64 rather than the 1:2 of V100, A100, Tesla, etc.). I have about 1/20th the FP64 power of an A100, but 1/2 to 1/3 the memory bandwidth, so I suspect there is a significant compute bottleneck. My own testing agrees with that discussed in NekRS, a GPU-Accelerated Spectral Element Navier-Stokes Solver that there is not much of a speedup when using single precision for the coarse grid solve. However, I assume most of the compute work is not for the coarse grid solve (?), and that by using FP32 for both the coarse grid solve and everywhere else, my compute bottleneck and would be greatly reduced. I would also expect to get more performance for the memory bandwidth, since the word size would be smaller. I obviously wouldn't expect to run 64x faster, but I would find it very interesting if I can shift the bottleneck from compute to memory bandwidth. I also do realize that there can be numerical issues with going to FP32, but I think it would be interesting to try out. TL;DR Thanks, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
That's not supported at the moment. |
Beta Was this translation helpful? Give feedback.
That's not supported at the moment.