-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Layernorm kernel #641
base: main_perf
Are you sure you want to change the base?
Add Layernorm kernel #641
Conversation
b772444
to
674d526
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rahulbatra85, I can't see anything wrong with your PR. I've just have some questions and minor code cleanup suggestions. Feel free to ignore them if you judge appropriate.
Please drop a short line about this new kernel to python/perf-kernels/README.md
file.
@@ -128,8 +128,10 @@ jobs: | |||
pytest -vvv ./python/perf-kernels/flash-attention.py | |||
pytest -vvvv ./python/perf-kernels/softmax.py | |||
pytest -vvv ./python/perf-kernels/rmsnorm.py | |||
pytest -vvv ./python/perf-kernels/layernorm.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about running all tests with just one pytest
invocation? According to https://docs.pytest.org/en/stable/how-to/usage.html, it's possible to do something like pytest -vvvv ./python/perf-kernels
. By this way, we'll be editing .github/workflows/amd_perf_kernel_Integration_tests.yml
less often and new tests are going to run by default. Do you see any drawback?
Maybe it's worth asking @micmelesse's opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's a Michael question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait for Michael's opinion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is fine. I think some of the tests are broken but maybe worth it to see the state of things
python/perf-kernels/layernorm.py
Outdated
y = x_hat * w + b | ||
# Write output | ||
tl.store(Y + cols, y, mask=mask) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an idea:
We have three for
loops that do masked loads. Do you foresee any benefit of peeling the last iteration of each loop so all iterations except the last one do unmasked loads? I think Shucai and Xiaohu got some performance improvements doing this with GEMMs. I'm not sure if the idea could be beneficial for layer norm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, yeah, I didn't think of that. Will try this out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let me know if this helped at all.
d88abbc
to
13c01c4
Compare
This comment was marked as resolved.
This comment was marked as resolved.
13c01c4
to
042aa91
Compare
No description provided.