How can online inference or server based inference be made faster? #11488

SefaZeng · 2024-12-25T07:55:56Z

SefaZeng
Dec 25, 2024

I've tried both offline batch inference and server inference. I found that with the same dataset and the same model, the speed of server inference is more than twice as slow as that of offline batch inference.

I guess the main reason is that offline batch inference uses batched inference, because the default value of max_num_seqs is 256. (Please correct me if my understanding is wrong.)

If I change the offline inference to input one by one, it will become very slow.

However, I don't know how to transform server inference into the form of batch inference.

Also, I'm wondering if there are other options that have caused the server to slow down. If there are, please let me know. I'd be really grateful!

The offline batch inference scripts is like:

...
        llm = LLM(
            model=args.model_name_or_path,
            tensor_parallel_size=len(available_gpus) // args.pipeline_parallel_size,
            pipeline_parallel_size=args.pipeline_parallel_size,
            trust_remote_code=True,
            max_seq_len_to_capture=args.max_tokens_per_call,
        )
...
        outputs = llm.generate(samples, sampling_params)

The server scripts:

CUDA_VISIBLE_DEVICES=0 vllm serve ${MODEL} --max-model-len 4096 --tensor-parallel-size 1 --port 8000 --enforce-eager

And the client scripts:

...

        client = OpenAI(
            api_key=openai_api_key,
            base_url=openai_api_base.format(**data),
        )
        system = ''' 
SOME SYSTEM PROMPT
        '''
        prompt = ''' 
PROMPT
        '''
        # data is one sampple
        prompt = prompt.format(**data)
        msgs = [ 
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ]
        response = client.chat.completions.create(
            messages=msgs,
            model=model,
            max_tokens=1024,
            temperature=1.0,
            top_p=0.8,
            n=1,
            seed=data['seed'],
            extra_body={'repetition_penalty': 1.05}
        )   
        data['response'] = response.choices[0].message.content
...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can online inference or server based inference be made faster? #11488

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

How can online inference or server based inference be made faster? #11488

SefaZeng Dec 25, 2024

Replies: 0 comments

SefaZeng
Dec 25, 2024