Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hopper config #80

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from
Open

Hopper config #80

wants to merge 5 commits into from

Conversation

christindbose
Copy link

@christindbose christindbose commented Oct 24, 2024

This is an attempt to update the configs with the most relevant features from Hopper (SMX5 to be precise). The key config parameters modified are:

  • Num of SMs
  • Num of memory channels, datawidth per channel (HBM2->HBM3 has double the number of channels/stack but half the datawidth)
  • L1D cache size
  • L2 cache size

@christindbose christindbose marked this pull request as ready for review October 24, 2024 20:19
@christindbose

This comment was marked as outdated.

@kunal-mansukhani
Copy link

kunal-mansukhani commented Jan 3, 2025

@christindbose Tried this out and got

hashing.cc:88: unsigned int ipoly_hash_function(new_addr_type, unsigned int, unsigned int): Assertion "\nmemory_partition_indexing error: The number of " "channels should be " "16, 32 or 64 for the hashing IPOLY index function. other banks " "numbers are not supported. Generate it by yourself! \n" && 0' failed.

Because the number of memory channels is 80 and so it can't be IPOLY hashed. What's the workaround?

@christindbose
Copy link
Author

@kunal-mansukhani So this is because of the L2 cache configuration setting. I have pushed a simple fix. Please try it out and see if it works for your case.

@kunal-mansukhani
Copy link

@christindbose
Thanks for the help! I tried out your change and it fixed the error I was getting. The program runs correctly now. But tell me if this makes sense:

I'm running a 32 x 32 shared matmul using the TITAN V vs H100 Config
Titan V Results:

gpgpu_simulation_time = 0 days, 0 hrs, 0 min, 1 sec (1 sec)
gpgpu_simulation_rate = 44032 (inst/sec)
gpgpu_simulation_rate = 903 (cycle/sec)
gpgpu_silicon_slowdown = 1328903x

H100 Results:

gpgpu_simulation_time = 0 days, 0 hrs, 0 min, 3 sec (3 sec)
gpgpu_simulation_rate = 14677 (inst/sec)
gpgpu_simulation_rate = 1971 (cycle/sec)
gpgpu_silicon_slowdown = 574327x

Shouldn't the H100 be much faster when factoring in the gpgpu_silicon_slowdown?

@christindbose
Copy link
Author

What you're seeing is the simulation time and not the program runtime. The hopper is a larger GPU in terms of the number of resources (#SMs, #channels etc) and so it's possible that the simulation time is longer. The kernel runtime is given by the cycle count; so you should really be looking at that.

@kunal-mansukhani
Copy link

@christindbose

Got it so if I do total cycles / core clock speed that should get me the actual program execution time if it were executed on that device correct? I'm doing that for this comparison and Titan V is still coming out ahead.

is it that the overhead is too large relative to the actual compute? Should I be trying larger matmuls?

@christindbose
Copy link
Author

That is correct. How much diff are we talking about?

You should be looking at larger matrix sizes. It's possible that the hopper is highly underutilized at small sizes.

@kunal-mansukhani
Copy link

@christindbose The total number of cycles for the H100 is consistently higher thant he total number of cycles for the Titan V, and they both have a similar clock speed, so it looks like for the same program, the Titan V has a shorter execution time. I tried on larger matricies but it still seems the same.

The program I'm using doesn't leverage Tensor Cores, is that the issue?

@christindbose
Copy link
Author

@kunal-mansukhani It's fine to not use tensor cores.

I'd like to know more about your setup. Are you running the simulations in PTX or trace mode? The current hopper config doesn't reflect the clock freq of the actual hopper GPU (we mostly only scaled up the relevant hardware resources). So that will need to be fixed in order to compare with the actual hopper.

@kunal-mansukhani
Copy link

@christindbose
I'm running the simulation in PTX mode, using CUDA 11.7. So when I calculate the GPU execution time should I use the clock speed in the config file or the real Hopper core clock speed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants