Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements #81

Closed
vad0 opened this issue Apr 22, 2024 · 7 comments · Fixed by #87
Closed

Performance improvements #81

vad0 opened this issue Apr 22, 2024 · 7 comments · Fixed by #87

Comments

@vad0
Copy link

vad0 commented Apr 22, 2024

Thanks a lot for a clear examples of using the SWIG APIs. I tried to use this library in the low latency environment and had to do a bunch of things to make it work faster. Some of them can be described in the documentation - it will save time for users. Others can be implemented in code. My task is to make predictions as fast as possible. Here are the benchmarks.

Benchmark                                              Mode  Cnt    Score      Error  Units
LightGbmBenchmark.predictDefault                       avgt    5  909.075 ± 2637.741  us/op
LightGbmBenchmark.predictSingleThread                  avgt    5   12.242 ±    3.748  us/op

LightGbmBenchmark.predictSingleRow                     avgt    5    7.951 ±    0.341  us/op
LightGbmBenchmark.predictSingleRowNoAllocation         avgt    5    6.969 ±    0.155  us/op
LightGbmBenchmark.predictSingleRowFast                 avgt    5    2.821 ±    1.061  us/op
LightGbmBenchmark.predictSingleRowUnsafe               avgt    5    2.535 ±    0.110  us/op

The base case is predictDefault benchmark which calls predictForMat function. We see that is works almost 1ms. Why is it? Because by default LightGBM tries to parallelize computations. It creates threads under the hood, executes jobs asynchronously, waits for results, etc. All of this doesn't make sense if our job is to make predictions. So I added the following argument: "num_threads=1 device=cpu". The result can be seen in predictSingleThread benchmark. It became only 12us. So, I suggest you to add the same argument by default when making predictions.

Then I found out that there is a method which predicts only a single row which works for 7.9us on my machine. Please mention it in the docs! 7.9us is good, but can be further improved.

Method predictForMatSingleRow allocates a bunch of off-heap structures under the hood, makes prediction, then frees these structures. If we reuse these structures, then we can save some time. The benchmark predictSingleRowNoAllocation works for 6.9us. I would suggest to overload predictForMatSingleRow method by extracting dataBuffer, outBuffer and outLength to parameters.

Still we can do better. LightGBM allows to make preparations once, and then make many predictions by calling methods LGBM_BoosterPredictForMatSingleRowFastInit and LGBM_BoosterPredictForMatSingleRowFast respectively. lightgbm4j doesn't expose this functionality. However, if we make predictions this way, then we further reduce single call time to 2.8us. Async profiler shows that LightGBM does not call malloc/free in this case at all. Probably one should add these methods to the friendly API.

And finally, we can do even better. JNI is not the fastest way to pass data to the native memory. Since SWIG gives us raw addresses, we can read and write at these addresses using Unsafe. Or we can use a friendlier UnsafeBuffer from agrona. This saves us another 0.29us, which is 10% speed up at this point. Here is my repo with experiments https://github.com/vad0/lightgbm. If anyone is interested in doing some of the things I described, then I will be ready to help.

@jornd13
Copy link

jornd13 commented Jun 26, 2024

Fascinating performance/speed improvements for inference.
If my use case was to generate the predictions in an as parallel fasion as possible (batch scenario), are you saying that this will be supported by default when I simply use the predictForMat function?

Secondly, if I were to run either training or inference workloads on my GPU, is there currently an easy way to expose a switch to go from "device=cpu" to "device=gpu" within the lightgbm4j API? Thanks a lot!

@vad0
Copy link
Author

vad0 commented Jun 26, 2024

My case is different. I need to make prediction in a single thread as fast as possible. I can't make an informed advice about the parallel scenario.

@shuttie
Copy link
Contributor

shuttie commented Jun 27, 2024

Hey @vad0 thanks for your ideas on latency!

I've exposed fast-predict methods in the API, and also added a section in the docs on latency implications of predict-mat, predict-row and predict-fast methods.

As for exposing SWIG buffers externally to avoid allocations, I'm on the edge about this idea - as then you're one step close to the accidental JVM segfault if you do something wrong with these datastructures. Feel free to reopen the ticket (or propose a PR in a safe way of avoiding allocations at all).

@shuttie shuttie closed this as completed Jun 27, 2024
@shuttie
Copy link
Contributor

shuttie commented Jun 27, 2024

As for single-row latency, the 4.4 released recently has a fix related to your issue: microsoft/LightGBM#6024

@jornd13
Copy link

jornd13 commented Jun 30, 2024

Is it possible that the published maven package lightgbm4j-4.3.0-1, which is referenced in the docs, does not contain these latest changes yet? E.g. I don't seem to find LGBMBooster.FastConfig exposed there.

@shuttie
Copy link
Contributor

shuttie commented Jul 2, 2024

@jornd13 Yes, these changes were made only on master branch (so previous artifacts were unchanged), but I'm preparing the new release now.

@shuttie
Copy link
Contributor

shuttie commented Jul 2, 2024

here is the one: https://github.com/metarank/lightgbm4j/releases/tag/4.4.0-1 . Maven artifacts may take up to 8 hours to be available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants