You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have transformer policy in an RL setting. Can VLLM accelerate sampling actions from this policy?
IIRC, VLLM is fast in consecutive generation where token after token is sampled until <eos>. However, in RL environments we cannot sample actions seamlessly. An action is sampled and fed to environment to get the next observation. Then, another query with the concatenated new observation is made for the next action. Can VLLM speed up sampling in this setting as well?
I think it boils down to the question of whether VLLM has a faster forward pass as well compared to PyTorch model and also if the caching of prefixes between distinct queries can help me get a faster sampling (as most of the observation sequence between consecutive queries are the same)?
As a side question, does VLLM support sampling from ad-hoc PyTorch architectures?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi everyone!
I have transformer policy in an RL setting. Can VLLM accelerate sampling actions from this policy?
IIRC, VLLM is fast in consecutive generation where token after token is sampled until
<eos>
. However, in RL environments we cannot sample actions seamlessly. An action is sampled and fed to environment to get the next observation. Then, another query with the concatenated new observation is made for the next action. Can VLLM speed up sampling in this setting as well?I think it boils down to the question of whether VLLM has a faster forward pass as well compared to PyTorch model and also if the caching of prefixes between distinct queries can help me get a faster sampling (as most of the observation sequence between consecutive queries are the same)?
As a side question, does VLLM support sampling from ad-hoc PyTorch architectures?
Thank you very much for your help.
Beta Was this translation helpful? Give feedback.
All reactions