Cross-validation not seeded implies wrong Prediction Intervals #336

vincentblot28 · 2023-08-04T13:42:24Z

Hello, I encountered a problem in MAPIE. When we use MapieRegressor with J+ or CV+ without any random state, then the Kfold and LeaveOneOut methods of sklearn are not seeded. It means that when computing the $R_i^{LOO}$, it happens that we compute them with which used the $i$th observation as training point. Here is a reproducible example to show that, without seed, when calling cv.split(X) at two different moments leads to different split of the data (which is the case in EnsembleRegressor)

from mapie.regression import MapieRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

mapie = MapieRegressor(method="plus", cv=5)
mapie.fit(X_train, y_train)
for train_index, _ in mapie.estimator_.cv.split(X_train):
    print("Train index: ", train_index)
    
for _, cal_index in mapie.estimator_.cv.split(X_train):
    print("Cal index: ", cal_index)

The text was updated successfully, but these errors were encountered:

vincentblot28 · 2023-08-04T14:03:16Z

Here is a test that should pass once the bug is fixed:

def test_same_split_no_random_state():
    X, y = make_regression(10)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
    
    mapie = MapieRegressor(method="plus", cv=5)
    mapie.fit(X_train, y_train)
    train_indices_1 = []
    train_indices_2 = []
    for train_index, _ in mapie.estimator_.cv.split(X_train):
        train_indices_1.append(train_index)
    
    for train_index, _ in mapie.estimator_.cv.split(X_train):
        train_indices_2.append(train_index)
        
    for i in range(mapie.estimator_.cv.get_n_splits()):
        assert (train_indices_1[i] == train_indices_2[i]).all()

thibaultcordier · 2023-08-04T14:52:06Z

Thank you for reporting this bug. The problem should be solved directly in check_cv in utils.py as it is a wrapper around the cv attribute of the MAPIE estimator (called at fit time for both regression and classification). Instead, I suggest a test for this function.

vincentblot28 added the Bug Type: bug label Aug 4, 2023

thibaultcordier added the Source: developers Proposed by developers. label Aug 4, 2023

thibaultcordier added this to the Release 0.7.0 milestone Aug 4, 2023

thibaultcordier modified the milestones: Release 0.7.0, Release 0.6.6 Aug 4, 2023

thibaultcordier mentioned this issue Aug 4, 2023

Add tests for check_cv method with and without random_state #337

Merged

9 tasks

thibaultcordier linked a pull request Aug 4, 2023 that will close this issue

Add tests for check_cv method with and without random_state #337

Merged

9 tasks

thibaultcordier closed this as completed in #337 Aug 30, 2023

thibaultcordier self-assigned this Aug 30, 2023

github-actions bot mentioned this issue Jun 24, 2024

Monthly issue metrics report for opened issues and prs thibaultcordier/MAPIE#11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-validation not seeded implies wrong Prediction Intervals #336

Cross-validation not seeded implies wrong Prediction Intervals #336

vincentblot28 commented Aug 4, 2023

vincentblot28 commented Aug 4, 2023

thibaultcordier commented Aug 4, 2023

Cross-validation not seeded implies wrong Prediction Intervals #336

Cross-validation not seeded implies wrong Prediction Intervals #336

Comments

vincentblot28 commented Aug 4, 2023

vincentblot28 commented Aug 4, 2023

thibaultcordier commented Aug 4, 2023