Replies: 9 comments
-
One practical approach is to launch many instances (mxnet itself) and each one uses several cores, say 4,8,16, for your inference. So you can maximize the overall throughput a lot. BTW, MKLDNN backend will be much faster now. You need to specify the thread number and bind thread to physical cores explicitly. https://github.com/apache/incubator-mxnet/blob/master/MKLDNN_README.md Some of out-of -data performance in below link. You can see this method works very well. |
Beta Was this translation helpful? Give feedback.
-
@mxnet-label-bot : [C++, Question] |
Beta Was this translation helpful? Give feedback.
-
Hi pengzhao. |
Beta Was this translation helpful? Give feedback.
-
Hi pengzhao. |
Beta Was this translation helpful? Give feedback.
-
@qw42 , you can use master branch or release version >= 1.2.0. A simple experiment by following commands.
|
Beta Was this translation helpful? Give feedback.
-
@qw42 did you have a chance to try and could this approach help for your case? |
Beta Was this translation helpful? Give feedback.
-
Hi pengzhao. P.S. I am using MKL-DNN, but is not related to scaling. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the feedback. It needs the FW level supports or you can write your own code for your targets. |
Beta Was this translation helpful? Give feedback.
-
What is the best way to achieve best throughput (only forward runs) on a multi-cpu/many-core server?
Description
I am using CPP package. My problem is "embarrassingly parallel" - each forward pass can be run independently. It looks to me that mxnet doesn't scale well with number of CPUs.
My goal is to maximize overall throughput, not to minimize each forward passes' computation time.
I can achieve it by running multiple processes (one per core). Maybe I missed something, but I couldn't find a way to do it with multiple threads.
Beta Was this translation helpful? Give feedback.
All reactions