Skip to content

Tensor parallel in distributed inference #10118

Answered by andoorve
MohmedMonsef asked this question in Q&A
Discussion options

You must be logged in to vote

Hey sure,

So with tensor parallelism we have the downside that communication cost is higher. However, this is not typically a concern for us in a single node with very good networking, and in the case where we don't have a lot of long prefills. With TP you can enjoy very low latency because you can use all the memory bandwidth available across all your GPUs. We are memory bound most of the time so this is preferable. PP is ideal in a case where communication is expensive compared to this (poor interconnect, more communication volume from prefills, cross-node).

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@MohmedMonsef
Comment options

@andoorve
Comment options

andoorve Nov 8, 2024
Collaborator

Answer selected by MohmedMonsef
@MohmedMonsef
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants