Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross-validation not seeded implies wrong Prediction Intervals #336

Closed
vincentblot28 opened this issue Aug 4, 2023 · 2 comments · Fixed by #337
Closed

Cross-validation not seeded implies wrong Prediction Intervals #336

vincentblot28 opened this issue Aug 4, 2023 · 2 comments · Fixed by #337
Assignees
Labels
Bug Type: bug Source: developers Proposed by developers.
Milestone

Comments

@vincentblot28
Copy link
Collaborator

Hello, I encountered a problem in MAPIE. When we use MapieRegressor with J+ or CV+ without any random state, then the Kfold and LeaveOneOut methods of sklearn are not seeded. It means that when computing the $R_i^{LOO}$, it happens that we compute them with which used the $i$th observation as training point. Here is a reproducible example to show that, without seed, when calling cv.split(X) at two different moments leads to different split of the data (which is the case in EnsembleRegressor)

from mapie.regression import MapieRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

mapie = MapieRegressor(method="plus", cv=5)
mapie.fit(X_train, y_train)
for train_index, _ in mapie.estimator_.cv.split(X_train):
    print("Train index: ", train_index)
    
for _, cal_index in mapie.estimator_.cv.split(X_train):
    print("Cal index: ", cal_index)
@vincentblot28 vincentblot28 added the Bug Type: bug label Aug 4, 2023
@thibaultcordier thibaultcordier added the Source: developers Proposed by developers. label Aug 4, 2023
@thibaultcordier thibaultcordier added this to the Release 0.7.0 milestone Aug 4, 2023
@vincentblot28
Copy link
Collaborator Author

Here is a test that should pass once the bug is fixed:

def test_same_split_no_random_state():
    X, y = make_regression(10)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
    
    mapie = MapieRegressor(method="plus", cv=5)
    mapie.fit(X_train, y_train)
    train_indices_1 = []
    train_indices_2 = []
    for train_index, _ in mapie.estimator_.cv.split(X_train):
        train_indices_1.append(train_index)
    
    for train_index, _ in mapie.estimator_.cv.split(X_train):
        train_indices_2.append(train_index)
        
    for i in range(mapie.estimator_.cv.get_n_splits()):
        assert (train_indices_1[i] == train_indices_2[i]).all()

@thibaultcordier
Copy link
Collaborator

Thank you for reporting this bug. The problem should be solved directly in check_cv in utils.py as it is a wrapper around the cv attribute of the MAPIE estimator (called at fit time for both regression and classification). Instead, I suggest a test for this function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Type: bug Source: developers Proposed by developers.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants