Add rmsnorm kernel #633

rahulbatra85 · 2024-09-04T17:17:59Z

Adds forward kernel for RMSNorm

jtang10 · 2024-09-05T16:33:09Z

python/perf-kernels/rmsnorm.py

+        triton.Config({}, num_warps=16, num_stages=1),
+    ]
+
+def get_hip_autotune_config():


Nothing wrong here. Just wondering if these configs are comprehensive enough to cover the typical use cases you have encountered in actual models

Good question, I just came up with these on what I thought made sense. Other than that, no systematic way to come up with this. I was thinking may be I can add an argument to take in a custom config file. That way one doesn't need to touch the code in this file if some other configs need to be benchmarked.

Actually, I noticed a small bug for CUDA. There are repeated entries.

python/perf-kernels/rmsnorm.py

brunomazzottiamd · 2024-09-05T19:29:21Z

python/perf-kernels/rmsnorm.py

+        row = tl.load(row_start_ptr + col_offsets, mask=mask, other=0.0)
+        g = tl.load(g_ptr + col_offsets, mask=mask, other=0.0)
+        row_norm = row * row #square each value
+        row_norm = tl.sum(row_norm, axis=-1) #sum across columns(axis=-1)


For RMSNorm, is it always safe to accumulate FP16 values with a FP16 accumulator? Do we need a FP32 accumulator here? I'm supposing that tl.sum of FP16's is also a FP16.

For instance, if we square 7 three-digit numbers and then add them together we're already overflowing FP16.

I'm not that familiar with normalization layers, maybe I'm just too conservative and I'm thinking too much in the general case.

I need to think about this case a bit more.

Let me try to give you some food for thought...

Use a wider data type as accumulator

Our GEMM kernel does this:

# INT32 accumulator for INT8 data and FP32 accumulator for everything else. acc_dtype = tl.float32 if a_ptr.type.element_ty != tl.int8 else tl.int32 accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=acc_dtype)

You can do something similar. I think it's safe to accumulate using FP32.

Use a scale factor

You can scale down the numbers before squaring them and scale up the mean of squares afterwards. Here is a NumPy prototype of the idea:

import numpy as np def mean_sqr(x): return np.sum(x * x) / len(x) # Scale factor can be a constant or computed on the fly. def mean_sqr_with_scale_factor(x, scale_factor): x = (1 / scale_factor) * x mean_sqr = np.sum(x * x) / len(x) return (scale_factor * scale_factor) * mean_sqr def compute_scale_factor(x): max_x = np.max(np.abs(x)) return np.exp(np.floor(np.log(max_x))) if max_x > 0 else 1 np.random.seed(42) x = np.random.uniform(size=4096, low=-500.0, high=500.0).astype(np.float32) print(mean_sqr(x)) print(mean_sqr_with_scale_factor(x, compute_scale_factor(x)))

Larger scale factors reduce the risk of overflow but increase the chance of losing precision. Smaller scale factors retain precision but may not prevent overflow effectively if the numbers are too large.

Inspect PyTorch implementation

Try to find out what PyTorch is doing. If PyTorch doesn't care about sum of squares overflow then we can follow it and doesn't care as well.

python/perf-kernels/rmsnorm.py

brunomazzottiamd · 2024-09-05T19:41:17Z

Seeing this PR reminded that I added a kernel without a benchmark and without a test! Shame on me! I think adding benchmark and correctness test must be mandatory from now on. Reviewers should reject PRs lacking these features.

python/perf-kernels/rmsnorm.py

micmelesse

Can we add tests for this kernel?

micmelesse · 2024-09-06T15:59:51Z

@rahulbatra85 There are failures in the RMS prop code

rahulbatra85 · 2024-09-06T16:03:38Z

@rahulbatra85 There are failures in the RMS prop code

yeah, looking.

rahulbatra85 · 2024-09-06T18:40:53Z

@rahulbatra85 There are failures in the RMS prop code

yeah, looking.

@micmelesse ah ok, so rms norm layer was added in Pytorch starting with version 2.4. Which docker image does the CI use?

micmelesse · 2024-09-06T18:47:41Z

@rahulbatra85 It's rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2 you can probably update it to the newest one

rahulbatra85 · 2024-09-06T22:11:54Z

@rahulbatra85 It's rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2 you can probably update it to the newest one

@micmelesse It's passing with new docker image.

Add rmsnorm kernel

rahulbatra85 force-pushed the main_perf-rmsnorm branch 2 times, most recently from 2794514 to b356942 Compare September 5, 2024 15:56

rahulbatra85 requested review from scxiao, jtang10 and vgokhale September 5, 2024 15:56

jtang10 reviewed Sep 5, 2024

View reviewed changes

This comment was marked as resolved.

Sign in to view

brunomazzottiamd reviewed Sep 5, 2024

View reviewed changes

python/perf-kernels/rmsnorm.py Outdated Show resolved Hide resolved

brunomazzottiamd reviewed Sep 5, 2024

View reviewed changes

python/perf-kernels/rmsnorm.py Outdated Show resolved Hide resolved

brunomazzottiamd mentioned this pull request Sep 5, 2024

Softmax kernel #634

Merged

This comment was marked as resolved.

Sign in to view

brunomazzottiamd reviewed Sep 5, 2024

View reviewed changes

python/perf-kernels/rmsnorm.py Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

rahulbatra85 force-pushed the main_perf-rmsnorm branch from b356942 to c949904 Compare September 5, 2024 22:34

brunomazzottiamd self-requested a review September 6, 2024 13:57

brunomazzottiamd assigned rahulbatra85 Sep 6, 2024

brunomazzottiamd approved these changes Sep 6, 2024

View reviewed changes

micmelesse requested changes Sep 6, 2024

View reviewed changes

rahulbatra85 force-pushed the main_perf-rmsnorm branch from c949904 to f625a47 Compare September 6, 2024 15:06

rahulbatra85 requested a review from micmelesse September 6, 2024 15:07

rahulbatra85 force-pushed the main_perf-rmsnorm branch from f625a47 to fa9cc06 Compare September 6, 2024 18:43

Add rmsnorm kernel

f80aed7

rahulbatra85 force-pushed the main_perf-rmsnorm branch from fa9cc06 to f80aed7 Compare September 6, 2024 19:29

rahulbatra85 requested review from brunomazzottiamd and jtang10 September 6, 2024 22:11

micmelesse approved these changes Sep 6, 2024

View reviewed changes

Merge branch 'main_perf' into main_perf-rmsnorm

9da4278

rahulbatra85 merged commit c4bd738 into main_perf Sep 7, 2024
4 checks passed

rahulbatra85 added a commit that referenced this pull request Sep 12, 2024

Merge pull request #633 from ROCm/main_perf-rmsnorm

81ca5e1

Add rmsnorm kernel

rahulbatra85 deleted the main_perf-rmsnorm branch September 24, 2024 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rmsnorm kernel #633

Add rmsnorm kernel #633

rahulbatra85 commented Sep 4, 2024

jtang10 Sep 5, 2024

rahulbatra85 Sep 5, 2024

This comment was marked as resolved.

This comment was marked as resolved.

brunomazzottiamd Sep 5, 2024

rahulbatra85 Sep 5, 2024

brunomazzottiamd Sep 6, 2024

brunomazzottiamd commented Sep 5, 2024

This comment was marked as resolved.

This comment was marked as resolved.

micmelesse left a comment •

edited

Loading

micmelesse commented Sep 6, 2024

rahulbatra85 commented Sep 6, 2024

rahulbatra85 commented Sep 6, 2024 •

edited

Loading

micmelesse commented Sep 6, 2024

rahulbatra85 commented Sep 6, 2024

Add rmsnorm kernel #633

Add rmsnorm kernel #633

Conversation

rahulbatra85 commented Sep 4, 2024

jtang10 Sep 5, 2024

Choose a reason for hiding this comment

rahulbatra85 Sep 5, 2024

Choose a reason for hiding this comment

This comment was marked as resolved.

This comment was marked as resolved.

brunomazzottiamd Sep 5, 2024

Choose a reason for hiding this comment

rahulbatra85 Sep 5, 2024

Choose a reason for hiding this comment

brunomazzottiamd Sep 6, 2024

Choose a reason for hiding this comment

Use a wider data type as accumulator

Use a scale factor

Inspect PyTorch implementation

brunomazzottiamd commented Sep 5, 2024

This comment was marked as resolved.

This comment was marked as resolved.

micmelesse left a comment • edited Loading

Choose a reason for hiding this comment

micmelesse commented Sep 6, 2024

rahulbatra85 commented Sep 6, 2024

rahulbatra85 commented Sep 6, 2024 • edited Loading

micmelesse commented Sep 6, 2024

rahulbatra85 commented Sep 6, 2024

micmelesse left a comment •

edited

Loading

rahulbatra85 commented Sep 6, 2024 •

edited

Loading