Performance improvements #81

vad0 · 2024-04-22T15:17:10Z

Thanks a lot for a clear examples of using the SWIG APIs. I tried to use this library in the low latency environment and had to do a bunch of things to make it work faster. Some of them can be described in the documentation - it will save time for users. Others can be implemented in code. My task is to make predictions as fast as possible. Here are the benchmarks.

Benchmark                                              Mode  Cnt    Score      Error  Units
LightGbmBenchmark.predictDefault                       avgt    5  909.075 ± 2637.741  us/op
LightGbmBenchmark.predictSingleThread                  avgt    5   12.242 ±    3.748  us/op

LightGbmBenchmark.predictSingleRow                     avgt    5    7.951 ±    0.341  us/op
LightGbmBenchmark.predictSingleRowNoAllocation         avgt    5    6.969 ±    0.155  us/op
LightGbmBenchmark.predictSingleRowFast                 avgt    5    2.821 ±    1.061  us/op
LightGbmBenchmark.predictSingleRowUnsafe               avgt    5    2.535 ±    0.110  us/op

The base case is predictDefault benchmark which calls predictForMat function. We see that is works almost 1ms. Why is it? Because by default LightGBM tries to parallelize computations. It creates threads under the hood, executes jobs asynchronously, waits for results, etc. All of this doesn't make sense if our job is to make predictions. So I added the following argument: "num_threads=1 device=cpu". The result can be seen in predictSingleThread benchmark. It became only 12us. So, I suggest you to add the same argument by default when making predictions.

Then I found out that there is a method which predicts only a single row which works for 7.9us on my machine. Please mention it in the docs! 7.9us is good, but can be further improved.

Method predictForMatSingleRow allocates a bunch of off-heap structures under the hood, makes prediction, then frees these structures. If we reuse these structures, then we can save some time. The benchmark predictSingleRowNoAllocation works for 6.9us. I would suggest to overload predictForMatSingleRow method by extracting dataBuffer, outBuffer and outLength to parameters.

Still we can do better. LightGBM allows to make preparations once, and then make many predictions by calling methods LGBM_BoosterPredictForMatSingleRowFastInit and LGBM_BoosterPredictForMatSingleRowFast respectively. lightgbm4j doesn't expose this functionality. However, if we make predictions this way, then we further reduce single call time to 2.8us. Async profiler shows that LightGBM does not call malloc/free in this case at all. Probably one should add these methods to the friendly API.

And finally, we can do even better. JNI is not the fastest way to pass data to the native memory. Since SWIG gives us raw addresses, we can read and write at these addresses using Unsafe. Or we can use a friendlier UnsafeBuffer from agrona. This saves us another 0.29us, which is 10% speed up at this point. Here is my repo with experiments https://github.com/vad0/lightgbm. If anyone is interested in doing some of the things I described, then I will be ready to help.

The text was updated successfully, but these errors were encountered:

jornd13 · 2024-06-26T17:34:32Z

Fascinating performance/speed improvements for inference.
If my use case was to generate the predictions in an as parallel fasion as possible (batch scenario), are you saying that this will be supported by default when I simply use the predictForMat function?

Secondly, if I were to run either training or inference workloads on my GPU, is there currently an easy way to expose a switch to go from "device=cpu" to "device=gpu" within the lightgbm4j API? Thanks a lot!

vad0 · 2024-06-26T20:14:04Z

My case is different. I need to make prediction in a single thread as fast as possible. I can't make an informed advice about the parallel scenario.

shuttie · 2024-06-27T08:27:25Z

Hey @vad0 thanks for your ideas on latency!

I've exposed fast-predict methods in the API, and also added a section in the docs on latency implications of predict-mat, predict-row and predict-fast methods.

As for exposing SWIG buffers externally to avoid allocations, I'm on the edge about this idea - as then you're one step close to the accidental JVM segfault if you do something wrong with these datastructures. Feel free to reopen the ticket (or propose a PR in a safe way of avoiding allocations at all).

shuttie · 2024-06-27T08:32:43Z

As for single-row latency, the 4.4 released recently has a fix related to your issue: microsoft/LightGBM#6024

jornd13 · 2024-06-30T06:22:03Z

Is it possible that the published maven package lightgbm4j-4.3.0-1, which is referenced in the docs, does not contain these latest changes yet? E.g. I don't seem to find LGBMBooster.FastConfig exposed there.

shuttie · 2024-07-02T12:19:10Z

@jornd13 Yes, these changes were made only on master branch (so previous artifacts were unchanged), but I'm preparing the new release now.

shuttie · 2024-07-02T12:22:30Z

here is the one: https://github.com/metarank/lightgbm4j/releases/tag/4.4.0-1 . Maven artifacts may take up to 8 hours to be available.

shuttie mentioned this issue Jun 27, 2024

expose fast predict methods #87

Merged

shuttie closed this as completed Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements #81

Performance improvements #81

vad0 commented Apr 22, 2024

jornd13 commented Jun 26, 2024

vad0 commented Jun 26, 2024

shuttie commented Jun 27, 2024

shuttie commented Jun 27, 2024

jornd13 commented Jun 30, 2024

shuttie commented Jul 2, 2024

shuttie commented Jul 2, 2024

Performance improvements #81

Performance improvements #81

Comments

vad0 commented Apr 22, 2024

jornd13 commented Jun 26, 2024

vad0 commented Jun 26, 2024

shuttie commented Jun 27, 2024

shuttie commented Jun 27, 2024

jornd13 commented Jun 30, 2024

shuttie commented Jul 2, 2024

shuttie commented Jul 2, 2024