Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support streaming batched chat completion requests #69

Merged
merged 3 commits into from
Jul 30, 2024
Merged

Conversation

guoqingbao
Copy link
Collaborator

This PR support batched chat completion requests, key changes:

  1. Engine and Pipeline revised to accept batched requests (in every 200 ms, later can be configurable), meaning simutanuous requests within 200 ms will be processed in batch;
  2. Models (especially rotary embedding) revised to support batched compute;
  3. Revised linear layer for better performance on batched matmul;
  4. A python demo for using batched chat completion requests added (examples/benchmark.py);

Tested case:

Server side:

cargo run --release -- --port 2000 --weight-path /home/llama2_7b/ llama --repeat-last-n 64

Client side:

python3 examples/benchmark.py

Performance results (max_tokens=1024, A100, BF16, LLaMa2 7B):

Server started at http://127.0.0.1:2000.



Prompt "[INST] Explain how to best learn Rust. [/INST]"
Request cmpl-0b7fd7b1-b2c1-495f-97f4-e87c4f5edee9 with length 17 added to sequence group.



Prompt "[INST] Please talk about deep learning in 100 words. [/INST]"
Request cmpl-9ac804dd-c00e-48ad-9ecb-f19364fa3f55 with length 19 added to sequence group.



Prompt "[INST] Do you know the capital city of China? Talk the details of you known. [/INST]"
Request cmpl-d8fb7981-59ab-4e2b-9f69-84bb93170fae with length 25 added to sequence group.



Prompt "[INST] Who is the best female actor in the world? Explain why. [/INST]"
Request cmpl-5b1f71bd-141a-4680-9173-2a9c46f2ac43 with length 21 added to sequence group.



Prompt "[INST] How to dealing with depression? [/INST]"
Request cmpl-cdcb9db3-8068-4e8a-8c2b-63d6412ad9e7 with length 15 added to sequence group.



Prompt "[INST] How to make money in short time? [/INST]"
Request cmpl-8204a6b4-58f4-44a1-b2a1-fef354b74fef with length 15 added to sequence group.



Prompt "[INST] What is the future trend of large language model? [/INST]"
Request cmpl-e1f3a8b5-91ab-45c6-af05-31db577f8d32 with length 19 added to sequence group.



Prompt "[INST] The famous tech companies in the world. [/INST]"
Request cmpl-5546945d-9aea-4c8f-906c-3a8b7d3c4e07 with length 17 added to sequence group.



Prompt "[INST] Explain how to best learn Rust. [/INST]"
Request cmpl-bef3358e-b3ad-49cb-9e4c-25175f6a0194 with length 17 added to sequence group.



Prompt "[INST] Please talk about deep learning in 100 words. [/INST]"
Request cmpl-55062609-d487-47cd-8114-cb173f544ba3 with length 19 added to sequence group.



Prompt "[INST] Do you know the capital city of China? Talk the details of you known. [/INST]"
Request cmpl-7cb5517a-b210-49a1-878e-9c902de89581 with length 25 added to sequence group.



Prompt "[INST] Who is the best female actor in the world? Explain why. [/INST]"
Request cmpl-c8693522-e3fd-4675-b465-176cf4ae9f18 with length 21 added to sequence group.



Prompt "[INST] How to dealing with depression? [/INST]"
Request cmpl-1ec8a66d-37e8-4714-9fe7-0a5822fd5fec with length 15 added to sequence group.



Prompt "[INST] How to make money in short time? [/INST]"
Request cmpl-60d6d28a-85c6-48da-80f7-fae13ce2d4bd with length 15 added to sequence group.



Prompt "[INST] What is the future trend of large language model? [/INST]"
Request cmpl-cda15dbd-ca16-4b29-ac4a-d6edbffac666 with length 19 added to sequence group.



Prompt "[INST] The famous tech companies in the world. [/INST]"
Request cmpl-96160a43-3bdc-4a94-9217-cfd494f040de with length 17 added to sequence group.
Request cmpl-55062609-d487-47cd-8114-cb173f544ba3 decoding finished in 3 seconds
Request cmpl-9ac804dd-c00e-48ad-9ecb-f19364fa3f55 decoding finished in 3 seconds
Request cmpl-c8693522-e3fd-4675-b465-176cf4ae9f18 decoding finished in 12 seconds
Request cmpl-5b1f71bd-141a-4680-9173-2a9c46f2ac43 decoding finished in 12 seconds
Request cmpl-5546945d-9aea-4c8f-906c-3a8b7d3c4e07 decoding finished in 13 seconds
Request cmpl-96160a43-3bdc-4a94-9217-cfd494f040de decoding finished in 13 seconds
Request cmpl-60d6d28a-85c6-48da-80f7-fae13ce2d4bd decoding finished in 14 seconds
Request cmpl-8204a6b4-58f4-44a1-b2a1-fef354b74fef decoding finished in 14 seconds
Request cmpl-1ec8a66d-37e8-4714-9fe7-0a5822fd5fec decoding finished in 17 seconds
Request cmpl-cdcb9db3-8068-4e8a-8c2b-63d6412ad9e7 decoding finished in 17 seconds
Request cmpl-7cb5517a-b210-49a1-878e-9c902de89581 decoding finished in 17 seconds
Request cmpl-d8fb7981-59ab-4e2b-9f69-84bb93170fae decoding finished in 17 seconds
Request cmpl-cda15dbd-ca16-4b29-ac4a-d6edbffac666 decoding finished in 18 seconds
Request cmpl-e1f3a8b5-91ab-45c6-af05-31db577f8d32 decoding finished in 18 seconds
Request cmpl-bef3358e-b3ad-49cb-9e4c-25175f6a0194 decoding finished in 23 seconds
Request cmpl-0b7fd7b1-b2c1-495f-97f4-e87c4f5edee9 decoding finished in 23 seconds

 [16 requets] Prefilling: 296 prompt tokens processed in 0 seconds

 [16 requets] Decoding: 9096 tokens processed in 23 seconds (386 tokens/s)

@guoqingbao guoqingbao merged commit e55e4b4 into master Jul 30, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant