Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase cuda_cpy block size #996

Merged
merged 1 commit into from
Oct 23, 2024

Conversation

bssrdf
Copy link
Contributor

@bssrdf bssrdf commented Oct 22, 2024

This PR gives a small performance boost to cuda backend cpy op. All test cases in test-backend-ops passed.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to get even better performance by replacing the scalar copies with copies of half2/float2.

@bssrdf
Copy link
Contributor Author

bssrdf commented Oct 23, 2024

Thanks for approving. I read somewhere that vectorized global memory read does not improve throughput much. It helps with loading from shared memory.

@JohannesGaessler JohannesGaessler merged commit d51c6c0 into ggerganov:master Oct 23, 2024
4 checks passed
@bssrdf bssrdf deleted the bump-cuda-cpy-block-size branch October 28, 2024 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants