Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

face the memory problem when use PredictFormat #88

Open
Kwoks2017 opened this issue Jun 27, 2024 · 10 comments
Open

face the memory problem when use PredictFormat #88

Kwoks2017 opened this issue Jun 27, 2024 · 10 comments

Comments

@Kwoks2017
Copy link

I want to output the Shap values, and the code as follow:

double[] shap_values = model.predictForMat(featArray, 1, featArray.length, true, PredictionType.C_API_PREDICT_CONTRIB);

But when i did the stress testing, this code will come up with some problem and eventually kill the porcess. Is there any solutions to deal with high qps?

I tried to alloacte more memory, but it failed.

@Kwoks2017
Copy link
Author

It seems some problems with memory allocation, and i just got the error as follow:

C [libc.so.6+0x84666] __libc_malloc+0x156

@Kwoks2017
Copy link
Author

I tried again, and everytime for a new request, i will load the model again, and this problem was fixed, but it is so inconvenient, as i have to cost so much time in loading the model...

@shuttie
Copy link
Contributor

shuttie commented Jun 27, 2024

@Kwoks2017 can you make a reproducer example with some dummy model? As I tried to replicate the issue with the following snippet:

LGBMDataset dataset = LGBMDataset.createFromFile("src/test/resources/cancer.csv", "header=true label=name:Classification", null);
LGBMBooster booster = LGBMBooster.create(dataset, "objective=binary label=name:Classification");
booster.updateOneIter();
booster.updateOneIter();
booster.updateOneIter();
Random rnd = new Random();
for (int i = 0; i < 10000000; i++) {
    float[] input = new float[9];
    for (int j=0; j<9; j++) {
        input[j] = rnd.nextFloat();
    }
    double[] pred = booster.predictForMat(input, 1, 9, true, PredictionType.C_API_PREDICT_CONTRIB);
}
dataset.close();
booster.close();

But for this case RSS memory stays flat around 200Mb.

@Kwoks2017
Copy link
Author

Kwoks2017 commented Jun 27, 2024

can you make a reproducer example with some dummy model? As I tried to replicate the issue with the following snippet:

Some differences,

  1. I load the model by using LGBMBooster.loadModelFromString(XXX);
  2. For a loop, the model is ok, as i tried to test the capability of the function. But when i did the stress testing, when the qps is over 10, the model will collapse.

My model predict function is as follow:

double[] featArray = new double[allFeat.size()];
for (int i = 0; i < allFeat.size(); i++) {
featArray[i] = allFeat.get(i).getDoubleValue("feat");
}
``
double[] shap_values = model.predictForMat(featArray, 1, featArray.length, true, PredictionType.C_API_PREDICT_CONTRIB);

for (int i = 0; i < allFeat.size(); i++) {
allFeat.get(i).put("shap", shap_values[i]);

And for the service, I create an instance for five model, like
MODEL = LGBMBooster.loadModelFromString(MODEL_STR);

@shuttie
Copy link
Contributor

shuttie commented Jun 28, 2024

Thank you for your hints, but can you please make a reproducer which is indeed reproduceable? :) So it should not be a snippet but a complete app I can run?

Some follow-up questions:

  • can you please also share how the model will collapse. - is it a segfault? If yes, what is being shown?
  • qps is over 10 - do you perform concurrent testing?

@shuttie
Copy link
Contributor

shuttie commented Jun 28, 2024

I guess it's related to this upstream issue: microsoft/LightGBM#2392

The code reproducing this issue:

        LGBMDataset dataset = LGBMDataset.createFromFile("src/test/resources/cancer.csv", "header=true label=name:Classification", null);
        LGBMBooster booster = LGBMBooster.create(dataset, "objective=binary label=name:Classification");
        booster.updateOneIter();
        booster.updateOneIter();
        booster.updateOneIter();
        booster.updateOneIter();
        booster.updateOneIter();

        IntStream.range(0, 1000000)
                .parallel()
                .forEach(i -> {
                    try {
                        double[] pred = booster.predictForMat(new float[]{1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f,9.0f}, 1, 9, true, PredictionType.C_API_PREDICT_CONTRIB);
                    } catch (LGBMException e) {
                        e.printStackTrace();
                    }
                });
        dataset.close();
        booster.close();

It crashes with:

[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
corrupted double-linked list

Process finished with exit code 134 (interrupted by signal 6:SIGABRT)

@shuttie
Copy link
Contributor

shuttie commented Jun 28, 2024

And it crashes only with C_API_PREDICT_CONTRIB, with C_API_PREDICT_NORMAL it's perfectly fine. So I guess it sounds like an upstream bug for some lightgbm c library parts not being thread-safe.

@Kwoks2017
Copy link
Author

Sorry for inconvenience cauesd, as i tried to solve this question in server these two days, later i will construct a simple case to simulate this question.

For the other questions:

  • can you please also share how the model will collapse. - is it a segfault? If yes, what is being shown?
    The condition was so strange, the process was killed automatically, and before this problem happended, from server log, we can just caught the error, like C [libc.so.6+0x84666] __libc_malloc+0x156 , some like fatal errors in C library, or say memory leak?

  • do you perform concurrent testing?
    Yes, i am doing the concurrent tesing. If for every request, i need to run the process, like load the model. make a predcition, and close the model, it maybe a bottleneck.

And I guess to gain the Shap values will allocate excessive memory, which cause this question.

@Kwoks2017
Copy link
Author

Kwoks2017 commented Jun 28, 2024

And it crashes only with C_API_PREDICT_CONTRIB, with C_API_PREDICT_NORMAL it's perfectly fine. So I guess it sounds like an upstream bug for some lightgbm c library parts not being thread-safe.

yes, i am afraid when making shap prediction, the process became not thred-safe, concurrenct request will continuously allocate more and more memory, which led to the collapse of the process. I tried to close the model after finishing the shap predicition, and it collapsed either.

And next thing i want to do is to sepreate the model cache and model prediction, like making a docker for the model cache.

@shuttie
Copy link
Contributor

shuttie commented Jun 28, 2024

Yes, it's an upstream issue: microsoft/LightGBM#5482

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants