You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have created an API service using Flask and Gunicorn to provide inference using sentence transformers model - all-MiniLM-L6-v2. I have converted and optimised the model using Huggingface Optimum and is using onnxruntime. I'm trying to provide the service for parallel requests so, I'm creating multiple docker containers, each spawning 1 gunicorn worker.. inside it's spawning 2 threads - 1 is the main thread or the calling thread.. and another is configured using intra_op_num_threads. I have set the calling thread to use only 1 thread using torch.set_num_threads(1) and for the child thread created by onnxruntime, I have used session options configuration to set the affinity of that thread to use only certain CPU core. Now, I also want to set the CPU affinity for the main thread or the calling thread so that all the threads spawned by a container use specified CPU cores only. I'm planning to pass the CPU core count to each container as arguments.
(It is specified that onnxruntime neither handles the cpu utilisation nor the affinity for the main thread or the calling thread)
CPU: AMD EPYC 7452 Zen 2
Number of cores: 64
onnxruntime version: 1.14.1
torch version: 2.0.1+cpu
Model: all-MiniLM-L6-v2
Update 1: I have set the affinity using os.sched_setaffinity(pid, affinity_mask) for the calling thread. Obviously, that entire process's affinity will be set to that core. So, it is completely overriding the affinity provided for the onnxruntime and all the threads spawned are eventually running under the set affinity mask.
Update 2: As I am using docker containers, I have achieved this using docker-compose file setting cpuset to 2 cores. I'm using 24 containers of the same model, each spawning 1 gunicorn worker (sync) in which intra_op threads is set to 2. The onnx thread is set to one of the core among the two cores that are assigned to that container. The main thread uses the other one. So, 24 containers - 48 cores, can handle upto 24 requests at a time. I have load balanced the containers using HAProxy. However, calling one single request at a time is giving the average inference time of 600 ms whereas, calling 24 parallel requests is giving the average inference time as 1.2 seconds. Is there any reason for this? Is it because the CPU aren't switching each other for the incoming requests?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I have created an API service using Flask and Gunicorn to provide inference using sentence transformers model - all-MiniLM-L6-v2. I have converted and optimised the model using Huggingface Optimum and is using onnxruntime. I'm trying to provide the service for parallel requests so, I'm creating multiple docker containers, each spawning 1 gunicorn worker.. inside it's spawning 2 threads - 1 is the main thread or the calling thread.. and another is configured using intra_op_num_threads. I have set the calling thread to use only 1 thread using torch.set_num_threads(1) and for the child thread created by onnxruntime, I have used session options configuration to set the affinity of that thread to use only certain CPU core. Now, I also want to set the CPU affinity for the main thread or the calling thread so that all the threads spawned by a container use specified CPU cores only. I'm planning to pass the CPU core count to each container as arguments.
(It is specified that onnxruntime neither handles the cpu utilisation nor the affinity for the main thread or the calling thread)
CPU: AMD EPYC 7452 Zen 2
Number of cores: 64
onnxruntime version: 1.14.1
torch version: 2.0.1+cpu
Model: all-MiniLM-L6-v2
Update 1: I have set the affinity using os.sched_setaffinity(pid, affinity_mask) for the calling thread. Obviously, that entire process's affinity will be set to that core. So, it is completely overriding the affinity provided for the onnxruntime and all the threads spawned are eventually running under the set affinity mask.
Update 2: As I am using docker containers, I have achieved this using docker-compose file setting cpuset to 2 cores. I'm using 24 containers of the same model, each spawning 1 gunicorn worker (sync) in which intra_op threads is set to 2. The onnx thread is set to one of the core among the two cores that are assigned to that container. The main thread uses the other one. So, 24 containers - 48 cores, can handle upto 24 requests at a time. I have load balanced the containers using HAProxy. However, calling one single request at a time is giving the average inference time of 600 ms whereas, calling 24 parallel requests is giving the average inference time as 1.2 seconds. Is there any reason for this? Is it because the CPU aren't switching each other for the incoming requests?
Beta Was this translation helpful? Give feedback.
All reactions