Better compiler caching strategy on Windows #462

zzjjbb · 2024-09-18T20:11:28Z

Is your feature request related to a problem? Please describe.
This problem is annoying me for years: I find pycuda runs extremely slow on Windows but not on Linux. My program contains ~20 ElementwiseKernels and ReductionKernels. I find that the SourceModule is used to compile the code, and it will save the cubin files to the cache_dir. It works well on any Linux machine as I tested, which only have ~1s overhead to load the functions later. However, running my code on Windows for the first time costs ~2min, and later it still costs ~1min. This is because it always need to preprocess the code since the source code always contains #include <pycuda-complex.hpp>:

pycuda/pycuda/compiler.py

Lines 89 to 90 in 96aab3f

    
           if "#include" in source: 
        
               checksum.update(preprocess_source(source, options, nvcc).encode("utf-8"))

As I tested, on any Windows computer, running nvcc --preprocess "empty_file.cu" --compiler-options -EP takes several seconds. In other words, the condition of whether using cache takes a very long time to compute.

Describe the solution you'd like
I tried to monkey patch this to remove the preprocess call above, and it works well. I'd like to find a better way to do it. The easiest way I can think of is adding an option to force ignoring the #include check (though it should not be used by default, since the user must know the potential risk)

Describe alternatives you've considered
Is there any nvcc options to speed-up the preprocessing? I don't know.

Additional context
The link below is one of the examples I worked on, but I guess any simple functionality of the GPUArray relies on the SourceModule is impacted by this.
https://github.com/bu-cisl/SSNP-IDT/blob/master/examples/forward_model.py

The text was updated successfully, but these errors were encountered:

inducer · 2024-09-18T20:46:21Z

Thanks for the report, I had no idea. Continuing along those lines, I'm not sure I have a good idea for how to approach this. We could introduce a flag so you don't have to monkeypatch, but that sacrifices correctness to an extent.

zzjjbb · 2024-09-18T21:22:52Z

#463
Does this make sense: we check the include_dirs, and ignore #include when that's empty? This should cover most of the simple cases, though it may still be incorrect if the user upgrades the CUDA/pycuda version, and keeps the cache (it's possible to append these version numbers in the cache folder/file name to invalidate this case). Also, we can do this only for the poor Windows users to minimize the potential problems.

zzjjbb added the enhancement label Sep 18, 2024

zzjjbb mentioned this issue Sep 18, 2024

remove expensive preprocess for cache on Windows #463

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better compiler caching strategy on Windows #462

Better compiler caching strategy on Windows #462

zzjjbb commented Sep 18, 2024 •

edited

Loading

inducer commented Sep 18, 2024

zzjjbb commented Sep 18, 2024 •

edited

Loading

Better compiler caching strategy on Windows #462

Better compiler caching strategy on Windows #462

Comments

zzjjbb commented Sep 18, 2024 • edited Loading

inducer commented Sep 18, 2024

zzjjbb commented Sep 18, 2024 • edited Loading

zzjjbb commented Sep 18, 2024 •

edited

Loading

zzjjbb commented Sep 18, 2024 •

edited

Loading