Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] fix flaky dask tests on GPU #5713

Closed
wants to merge 8 commits into from
Closed

[ci] fix flaky dask tests on GPU #5713

wants to merge 8 commits into from

Conversation

jameslamb
Copy link
Collaborator

Opening this to try to narrow down the flakiness in the GPU tests.

Reference #5677 (comment).

@jameslamb
Copy link
Collaborator Author

4 of the gpu builds failed in the first commit I pushed here, all on the same test.

../tests/python_package_test/test_engine.py::test_multiclass_custom_eval[False] PASSED [ 73%]
../tests/python_package_test/test_engine.py::test_model_size /__w/1/s/.ci/test.sh: line 251:  7934 Killed                  pytest -vvv $BUILD_DIRECTORY/tests

Interestingly, all of those jobs made it through all of the Dask tests.

@jameslamb
Copy link
Collaborator Author

jameslamb commented Feb 15, 2023

The success rate went up after skipping that one test_engine.py::test_model_size, but now I see another Dask test failure (build link

FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-goss-group1-array]
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-goss-group1-dataframe]
=================================== FAILURES ===================================
_____________________ test_ranker[data-goss-group1-array] ______________________

output = 'array', group = [5, 5, 5, 10, 10, 10, ...], boosting_type = 'goss'
tree_learner = 'data'
cluster = LocalCluster(f1e3d9c7, 'tcp://127.0.0.1:36971', workers=1, threads=2, memory=3.89 GiB)

    @pytest.mark.parametrize('output', ['array', 'dataframe', 'dataframe-with-categorical'])
    @pytest.mark.parametrize('group', [None, group_sizes])
    @pytest.mark.parametrize('boosting_type', boosting_types)
    @pytest.mark.parametrize('tree_learner', distributed_training_algorithms)
    def test_ranker(output, group, boosting_type, tree_learner, cluster):
        with Client(cluster) as client:
            if output == 'dataframe-with-categorical':
                X, y, w, g, dX, dy, dw, dg = _create_data(
                    objective='ranking',
                    output=output,
                    group=group,
                    n_features=1,
                    n_informative=1
                )
            else:
                X, y, w, g, dX, dy, dw, dg = _create_data(
                    objective='ranking',
                    output=output,
                    group=group
               )


_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
E   lightgbm.basic.LightGBMError: Socket recv error, Connection reset by peer (code: 104)

/home/AzDevOps_azpcontainer/.local/lib/python3.7/site-packages/lightgbm/basic.py:177: LightGBMError

@jameslamb
Copy link
Collaborator Author

Ok, just skipping the Dask learning-to-rank + GOSS test didn't help. All of the other Dask learning-to-rank GPU tests failed too. (build link)

=========================== short test summary info ============================
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-gbdt-None-array]
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-gbdt-None-dataframe]
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-gbdt-None-dataframe-with-categorical]
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-gbdt-group1-array]
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-gbdt-group1-dataframe]
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-gbdt-group1-dataframe-with-categorical]
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-dart-None-array]
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-dart-None-dataframe]
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-dart-None-dataframe-with-categorical]
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-dart-group1-array]
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-dart-group1-dataframe]
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-dart-group1-dataframe-with-categorical]
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-goss-None-array]
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-goss-None-dataframe]
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-goss-None-dataframe-with-categorical]
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-goss-group1-array]
FAILED ../tests/python_package_test/test_dask.py::test_ranker[data-goss-group1-dataframe]

@jameslamb
Copy link
Collaborator Author

This has been open for a long time without new activity, and there are several other higher-priority issues I'm spending my limited time on right now. I'm closing this to focus on other things.

@jameslamb jameslamb closed this Sep 4, 2023
@jameslamb jameslamb deleted the ci/gpu-tests branch September 4, 2023 03:43
Copy link

github-actions bot commented Dec 6, 2023

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant