increase cuda_cpy block size #996

bssrdf · 2024-10-22T19:53:01Z

This PR gives a small performance boost to cuda backend cpy op. All test cases in test-backend-ops passed.

JohannesGaessler

You should be able to get even better performance by replacing the scalar copies with copies of half2/float2.

bssrdf · 2024-10-23T14:32:27Z

Thanks for approving. I read somewhere that vectorized global memory read does not improve throughput much. It helps with loading from shared memory.

increase cuda_cpy block size

0766bda

bssrdf mentioned this pull request Oct 22, 2024

cpy op's block size is too small on cuda backend #995

Closed

JohannesGaessler approved these changes Oct 22, 2024

View reviewed changes

JohannesGaessler merged commit d51c6c0 into ggerganov:master Oct 23, 2024
4 checks passed

bssrdf deleted the bump-cuda-cpy-block-size branch October 28, 2024 02:23

Provide feedback