face the memory problem when use PredictFormat #88

Kwoks2017 · 2024-06-27T14:05:23Z

I want to output the Shap values, and the code as follow:

double[] shap_values = model.predictForMat(featArray, 1, featArray.length, true, PredictionType.C_API_PREDICT_CONTRIB);

But when i did the stress testing, this code will come up with some problem and eventually kill the porcess. Is there any solutions to deal with high qps?

I tried to alloacte more memory, but it failed.

The text was updated successfully, but these errors were encountered:

Kwoks2017 · 2024-06-27T14:06:43Z

It seems some problems with memory allocation, and i just got the error as follow:

C [libc.so.6+0x84666] __libc_malloc+0x156

Kwoks2017 · 2024-06-27T15:52:02Z

I tried again, and everytime for a new request, i will load the model again, and this problem was fixed, but it is so inconvenient, as i have to cost so much time in loading the model...

shuttie · 2024-06-27T17:24:15Z

@Kwoks2017 can you make a reproducer example with some dummy model? As I tried to replicate the issue with the following snippet:

LGBMDataset dataset = LGBMDataset.createFromFile("src/test/resources/cancer.csv", "header=true label=name:Classification", null);
LGBMBooster booster = LGBMBooster.create(dataset, "objective=binary label=name:Classification");
booster.updateOneIter();
booster.updateOneIter();
booster.updateOneIter();
Random rnd = new Random();
for (int i = 0; i < 10000000; i++) {
    float[] input = new float[9];
    for (int j=0; j<9; j++) {
        input[j] = rnd.nextFloat();
    }
    double[] pred = booster.predictForMat(input, 1, 9, true, PredictionType.C_API_PREDICT_CONTRIB);
}
dataset.close();
booster.close();

But for this case RSS memory stays flat around 200Mb.

Kwoks2017 · 2024-06-27T23:54:02Z

can you make a reproducer example with some dummy model? As I tried to replicate the issue with the following snippet:

Some differences,

I load the model by using LGBMBooster.loadModelFromString(XXX);
For a loop, the model is ok, as i tried to test the capability of the function. But when i did the stress testing, when the qps is over 10, the model will collapse.

My model predict function is as follow:

double[] featArray = new double[allFeat.size()];
for (int i = 0; i < allFeat.size(); i++) {
featArray[i] = allFeat.get(i).getDoubleValue("feat");
}
``
double[] shap_values = model.predictForMat(featArray, 1, featArray.length, true, PredictionType.C_API_PREDICT_CONTRIB);

for (int i = 0; i < allFeat.size(); i++) {
allFeat.get(i).put("shap", shap_values[i]);

And for the service, I create an instance for five model, like
MODEL = LGBMBooster.loadModelFromString(MODEL_STR);

shuttie · 2024-06-28T12:04:13Z

Thank you for your hints, but can you please make a reproducer which is indeed reproduceable? :) So it should not be a snippet but a complete app I can run?

Some follow-up questions:

can you please also share how the model will collapse. - is it a segfault? If yes, what is being shown?
qps is over 10 - do you perform concurrent testing?

shuttie · 2024-06-28T12:17:20Z

I guess it's related to this upstream issue: microsoft/LightGBM#2392

The code reproducing this issue:

        LGBMDataset dataset = LGBMDataset.createFromFile("src/test/resources/cancer.csv", "header=true label=name:Classification", null);
        LGBMBooster booster = LGBMBooster.create(dataset, "objective=binary label=name:Classification");
        booster.updateOneIter();
        booster.updateOneIter();
        booster.updateOneIter();
        booster.updateOneIter();
        booster.updateOneIter();

        IntStream.range(0, 1000000)
                .parallel()
                .forEach(i -> {
                    try {
                        double[] pred = booster.predictForMat(new float[]{1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f,9.0f}, 1, 9, true, PredictionType.C_API_PREDICT_CONTRIB);
                    } catch (LGBMException e) {
                        e.printStackTrace();
                    }
                });
        dataset.close();
        booster.close();

It crashes with:

[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
corrupted double-linked list

Process finished with exit code 134 (interrupted by signal 6:SIGABRT)

shuttie · 2024-06-28T12:18:25Z

And it crashes only with C_API_PREDICT_CONTRIB, with C_API_PREDICT_NORMAL it's perfectly fine. So I guess it sounds like an upstream bug for some lightgbm c library parts not being thread-safe.

Kwoks2017 · 2024-06-28T12:20:15Z

Sorry for inconvenience cauesd, as i tried to solve this question in server these two days, later i will construct a simple case to simulate this question.

For the other questions:

can you please also share how the model will collapse. - is it a segfault? If yes, what is being shown?
The condition was so strange, the process was killed automatically, and before this problem happended, from server log, we can just caught the error, like C [libc.so.6+0x84666] __libc_malloc+0x156 , some like fatal errors in C library, or say memory leak?
do you perform concurrent testing?
Yes, i am doing the concurrent tesing. If for every request, i need to run the process, like load the model. make a predcition, and close the model, it maybe a bottleneck.

And I guess to gain the Shap values will allocate excessive memory, which cause this question.

Kwoks2017 · 2024-06-28T12:26:26Z

And it crashes only with C_API_PREDICT_CONTRIB, with C_API_PREDICT_NORMAL it's perfectly fine. So I guess it sounds like an upstream bug for some lightgbm c library parts not being thread-safe.

yes, i am afraid when making shap prediction, the process became not thred-safe, concurrenct request will continuously allocate more and more memory, which led to the collapse of the process. I tried to close the model after finishing the shap predicition, and it collapsed either.

And next thing i want to do is to sepreate the model cache and model prediction, like making a docker for the model cache.

shuttie · 2024-06-28T13:18:09Z

Yes, it's an upstream issue: microsoft/LightGBM#5482

shuttie mentioned this issue Jun 28, 2024

Is predict(..., pred_contrib=True) thread safe? microsoft/LightGBM#5482

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

face the memory problem when use PredictFormat #88

face the memory problem when use PredictFormat #88

Kwoks2017 commented Jun 27, 2024

Kwoks2017 commented Jun 27, 2024

Kwoks2017 commented Jun 27, 2024

shuttie commented Jun 27, 2024

Kwoks2017 commented Jun 27, 2024 •

edited

Loading

shuttie commented Jun 28, 2024

shuttie commented Jun 28, 2024

shuttie commented Jun 28, 2024

Kwoks2017 commented Jun 28, 2024

Kwoks2017 commented Jun 28, 2024 •

edited

Loading

shuttie commented Jun 28, 2024

face the memory problem when use PredictFormat #88

face the memory problem when use PredictFormat #88

Comments

Kwoks2017 commented Jun 27, 2024

Kwoks2017 commented Jun 27, 2024

Kwoks2017 commented Jun 27, 2024

shuttie commented Jun 27, 2024

Kwoks2017 commented Jun 27, 2024 • edited Loading

shuttie commented Jun 28, 2024

shuttie commented Jun 28, 2024

shuttie commented Jun 28, 2024

Kwoks2017 commented Jun 28, 2024

Kwoks2017 commented Jun 28, 2024 • edited Loading

shuttie commented Jun 28, 2024

Kwoks2017 commented Jun 27, 2024 •

edited

Loading

Kwoks2017 commented Jun 28, 2024 •

edited

Loading