Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Layernorm kernel #641

Open
wants to merge 1 commit into
base: main_perf
Choose a base branch
from
Open

Add Layernorm kernel #641

wants to merge 1 commit into from

Conversation

rahulbatra85
Copy link

No description provided.

Copy link
Collaborator

@brunomazzottiamd brunomazzottiamd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rahulbatra85, I can't see anything wrong with your PR. I've just have some questions and minor code cleanup suggestions. Feel free to ignore them if you judge appropriate.

Please drop a short line about this new kernel to python/perf-kernels/README.md file.

@@ -128,8 +128,10 @@ jobs:
pytest -vvv ./python/perf-kernels/flash-attention.py
pytest -vvvv ./python/perf-kernels/softmax.py
pytest -vvv ./python/perf-kernels/rmsnorm.py
pytest -vvv ./python/perf-kernels/layernorm.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about running all tests with just one pytest invocation? According to https://docs.pytest.org/en/stable/how-to/usage.html, it's possible to do something like pytest -vvvv ./python/perf-kernels. By this way, we'll be editing .github/workflows/amd_perf_kernel_Integration_tests.yml less often and new tests are going to run by default. Do you see any drawback?

Maybe it's worth asking @micmelesse's opinion on this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's a Michael question

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait for Michael's opinion!

Copy link
Collaborator

@micmelesse micmelesse Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is fine. I think some of the tests are broken but maybe worth it to see the state of things

python/perf-kernels/layernorm.py Show resolved Hide resolved
python/perf-kernels/layernorm.py Outdated Show resolved Hide resolved
python/perf-kernels/layernorm.py Outdated Show resolved Hide resolved
y = x_hat * w + b
# Write output
tl.store(Y + cols, y, mask=mask)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea:
We have three for loops that do masked loads. Do you foresee any benefit of peeling the last iteration of each loop so all iterations except the last one do unmasked loads? I think Shucai and Xiaohu got some performance improvements doing this with GEMMs. I'm not sure if the idea could be beneficial for layer norm.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, yeah, I didn't think of that. Will try this out

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me know if this helped at all.

python/perf-kernels/layernorm.py Outdated Show resolved Hide resolved
python/perf-kernels/layernorm.py Show resolved Hide resolved
@brunomazzottiamd

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants