Inference time is better with single request, higher with parallel requests. #16048

Udayk02 · 2023-05-23T10:05:14Z

Udayk02
May 23, 2023

I have created an API service using Flask and Gunicorn to provide inference using sentence transformers model - all-MiniLM-L6-v2. I have converted and optimised the model using Huggingface Optimum and is using onnxruntime. I'm trying to provide the service for parallel requests so, I'm creating multiple docker containers, each spawning 1 gunicorn worker.. inside it's spawning 2 threads - 1 is the main thread or the calling thread.. and another is configured using intra_op_num_threads. I have set the calling thread to use only 1 thread using torch.set_num_threads(1) and for the child thread created by onnxruntime, I have used session options configuration to set the affinity of that thread to use only certain CPU core. Now, I also want to set the CPU affinity for the main thread or the calling thread so that all the threads spawned by a container use specified CPU cores only. I'm planning to pass the CPU core count to each container as arguments.
(It is specified that onnxruntime neither handles the cpu utilisation nor the affinity for the main thread or the calling thread)

CPU: AMD EPYC 7452 Zen 2
Number of cores: 64
onnxruntime version: 1.14.1
torch version: 2.0.1+cpu
Model: all-MiniLM-L6-v2

Update 1: I have set the affinity using os.sched_setaffinity(pid, affinity_mask) for the calling thread. Obviously, that entire process's affinity will be set to that core. So, it is completely overriding the affinity provided for the onnxruntime and all the threads spawned are eventually running under the set affinity mask.

Update 2: As I am using docker containers, I have achieved this using docker-compose file setting cpuset to 2 cores. I'm using 24 containers of the same model, each spawning 1 gunicorn worker (sync) in which intra_op threads is set to 2. The onnx thread is set to one of the core among the two cores that are assigned to that container. The main thread uses the other one. So, 24 containers - 48 cores, can handle upto 24 requests at a time. I have load balanced the containers using HAProxy. However, calling one single request at a time is giving the average inference time of 600 ms whereas, calling 24 parallel requests is giving the average inference time as 1.2 seconds. Is there any reason for this? Is it because the CPU aren't switching each other for the incoming requests?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference time is better with single request, higher with parallel requests. #16048

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Inference time is better with single request, higher with parallel requests. #16048

Udayk02 May 23, 2023

Replies: 0 comments

Udayk02
May 23, 2023