Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: prefetching #152

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

feat: prefetching #152

wants to merge 1 commit into from

Conversation

daquexian
Copy link
Contributor

@daquexian daquexian commented Jul 7, 2023

Add a prefetching strategy: *5+3 means the weights of the first 5 layers always remain in GPU memory, and when the i-th layer is executed, the weights of {i+3}-th layer will be prefetched asynchronously into GPU memory (and be dropped once the execution of {i+3}-th layer finishes).

Now the old stream strategy like *5+ becomes the abbrev of *5+0

The following shows the overlap between the computation cuda stream (blue) and memcpy cuda stream (green):

image

However, the prefetching feature doesn't necessarily speed up the inference compared to the old stream strategy with the same memory budget (e.g. *10+10 vs *20+0), because memcpy is much slower than computation and cannot be fully overlapped. Here are some benchmarks of 7b world model (bf16, RWKV_JIT_ON=1, RWKV_CUDA_ON=0, A100 80G):

Strategy GPU Mem Time
*10+0 6306MB 0.7756s
*15+0 8382MB 0.6054s
*10+10 10502MB 0.6912s
*15+5 10498MB 0.5655s
*20+0 10456MB 0.7067s
No stream 15184MB 0.0567s

7b world model (fp16, RWKV_JIT_ON=1, RWKV_CUDA_ON=1, A100 80G):

Strategy GPU Mem Time
*10+0 6346MB 0.6043s
*15+0 8422MB 0.5046s
*10+10 10602MB 0.5930s
*15+5 10600MB 0.4961s
*20+0 10498MB 0.5532s
No stream 15184MB 0.0195s

BTW, it may be helpful to have a prepare API which prefetches the weights manually to reduce the time of forward in some scenes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant