Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling too few classes in landmarker cross validation #170

Open
bjschoenfeld opened this issue Apr 5, 2019 · 6 comments
Open

Handling too few classes in landmarker cross validation #170

bjschoenfeld opened this issue Apr 5, 2019 · 6 comments
Labels

Comments

@bjschoenfeld
Copy link
Member

Our landmarkers perform cross validation with 2 folds. Some datasets may have only 1 instance of a particular target class. In this case, the validation in sklearn's cross validation throws an error, requiring at least n_folds (2 in our case) instances of each class. This is not pretty to have such an error thrown. How should we handle this?

@emrysshevek
Copy link
Contributor

Running on LL0_488_colleges_aaup dataset

Traceback (most recent call last):
  File "venv/lib/python3.6/site-packages/metalearn/metafeatures/metafeatures.py", line 113, in compute
    n_folds, verbose
  File "venv/lib/python3.6/site-packages/metalearn/metafeatures/metafeatures.py", line 234, in _validate_compute_arguments
    n_folds, verbose
  File "venv/lib/python3.6/site-packages/metalearn/metafeatures/metafeatures.py", line 348, in _validate_n_folds
    f"{group.shape[0]}."
ValueError: The minimum number of instances in each class of Y is n_folds=2. Class VIIB has 1.

@epeters3 epeters3 added the bug label May 29, 2019
@epeters3 epeters3 modified the milestone: June 2019 Submission May 29, 2019
@bjschoenfeld
Copy link
Member Author

Can we compare with OpenML on this?

@emrysshevek
Copy link
Contributor

Similar to this, datasets with fewer than 4 instances per class fail. Should we handle something like this?

import pandas as pd
import numpy as np
from metalearn import Metafeatures
x = pd.DataFrame(np.random.rand(8,2))
y = pd.Series(['a','a','a','b','b','b'])
Metafeatures().compute(x,y)

Traceback (most recent call last):
File "", line 1, in
File "metalearn/metafeatures/metafeatures.py", line 158, in compute
value, compute_time = self._get_resource(metafeature_id)
File "metalearn/metafeatures/metafeatures.py", line 390, in _get_resource
computed_resources = f(**args)
File "metalearn/metafeatures/landmarking_metafeatures.py", line 72, in get_lda
return run_pipeline(X, Y, pipeline, n_folds, cv_seed)
File "metalearn/metafeatures/landmarking_metafeatures.py", line 34, in run_pipeline
'accuracy': accuracy_scorer, 'kappa': kappa_scorer
File "sklearn/model_selection/_validation.py", line 240, in cross_validate
for train, test in cv.split(X, y, groups))
File "sklearn/externals/joblib/parallel.py", line 917, in call
if self.dispatch_one_batch(iterator):
File "sklearn/externals/joblib/parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "sklearn/externals/joblib/parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
result = ImmediateResult(func)
File "sklearn/externals/joblib/_parallel_backends.py", line 549, in init
self.results = batch()
File "sklearn/externals/joblib/parallel.py", line 225, in call
for func, args, kwargs in self.items]
File "sklearn/externals/joblib/parallel.py", line 225, in
for func, args, kwargs in self.items]
File "sklearn/model_selection/_validation.py", line 528, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "sklearn/pipeline.py", line 267, in fit
self._final_estimator.fit(Xt, y, **fit_params)
File "sklearn/discriminant_analysis.py", line 435, in fit
raise ValueError("The number of samples must be more "
ValueError: The number of samples must be more than the number of classes.

@bjschoenfeld
Copy link
Member Author

datasets with fewer than 4 instances per class fail

I believe you, but why is it 4, not 2? We only do 2-fold cv.

@emrysshevek
Copy link
Contributor

I think it's because with 2-fold cv the training set has half as many instances, so it needs at least 4

@bjschoenfeld
Copy link
Member Author

I would think that if there were only two instances and two folds, one instance would go to each fold. The folds would take turns being the train and test sets...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants