Handling too few classes in landmarker cross validation #170

bjschoenfeld · 2019-04-05T04:16:08Z

Our landmarkers perform cross validation with 2 folds. Some datasets may have only 1 instance of a particular target class. In this case, the validation in sklearn's cross validation throws an error, requiring at least n_folds (2 in our case) instances of each class. This is not pretty to have such an error thrown. How should we handle this?

emrysshevek · 2019-05-23T18:46:13Z

Running on LL0_488_colleges_aaup dataset

Traceback (most recent call last):
  File "venv/lib/python3.6/site-packages/metalearn/metafeatures/metafeatures.py", line 113, in compute
    n_folds, verbose
  File "venv/lib/python3.6/site-packages/metalearn/metafeatures/metafeatures.py", line 234, in _validate_compute_arguments
    n_folds, verbose
  File "venv/lib/python3.6/site-packages/metalearn/metafeatures/metafeatures.py", line 348, in _validate_n_folds
    f"{group.shape[0]}."
ValueError: The minimum number of instances in each class of Y is n_folds=2. Class VIIB has 1.

bjschoenfeld · 2019-05-30T00:44:57Z

Can we compare with OpenML on this?

emrysshevek · 2019-05-30T17:31:56Z

Similar to this, datasets with fewer than 4 instances per class fail. Should we handle something like this?

import pandas as pd
import numpy as np
from metalearn import Metafeatures
x = pd.DataFrame(np.random.rand(8,2))
y = pd.Series(['a','a','a','b','b','b'])
Metafeatures().compute(x,y)

Traceback (most recent call last):
File "", line 1, in
File "metalearn/metafeatures/metafeatures.py", line 158, in compute
value, compute_time = self._get_resource(metafeature_id)
File "metalearn/metafeatures/metafeatures.py", line 390, in _get_resource
computed_resources = f(**args)
File "metalearn/metafeatures/landmarking_metafeatures.py", line 72, in get_lda
return run_pipeline(X, Y, pipeline, n_folds, cv_seed)
File "metalearn/metafeatures/landmarking_metafeatures.py", line 34, in run_pipeline
'accuracy': accuracy_scorer, 'kappa': kappa_scorer
File "sklearn/model_selection/_validation.py", line 240, in cross_validate
for train, test in cv.split(X, y, groups))
File "sklearn/externals/joblib/parallel.py", line 917, in call
if self.dispatch_one_batch(iterator):
File "sklearn/externals/joblib/parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "sklearn/externals/joblib/parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
result = ImmediateResult(func)
File "sklearn/externals/joblib/_parallel_backends.py", line 549, in init
self.results = batch()
File "sklearn/externals/joblib/parallel.py", line 225, in call
for func, args, kwargs in self.items]
File "sklearn/externals/joblib/parallel.py", line 225, in
for func, args, kwargs in self.items]
File "sklearn/model_selection/_validation.py", line 528, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "sklearn/pipeline.py", line 267, in fit
self._final_estimator.fit(Xt, y, **fit_params)
File "sklearn/discriminant_analysis.py", line 435, in fit
raise ValueError("The number of samples must be more "
ValueError: The number of samples must be more than the number of classes.

bjschoenfeld · 2019-05-30T20:40:31Z

datasets with fewer than 4 instances per class fail

I believe you, but why is it 4, not 2? We only do 2-fold cv.

emrysshevek · 2019-05-30T20:43:28Z

I think it's because with 2-fold cv the training set has half as many instances, so it needs at least 4

bjschoenfeld · 2019-05-30T20:56:59Z

I would think that if there were only two instances and two folds, one instance would go to each fold. The folds would take turns being the train and test sets...

epeters3 added the bug label May 29, 2019

epeters3 modified the milestone: June 2019 Submission May 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling too few classes in landmarker cross validation #170

Handling too few classes in landmarker cross validation #170

bjschoenfeld commented Apr 5, 2019

emrysshevek commented May 23, 2019

bjschoenfeld commented May 30, 2019

emrysshevek commented May 30, 2019

bjschoenfeld commented May 30, 2019

emrysshevek commented May 30, 2019

bjschoenfeld commented May 30, 2019

Handling too few classes in landmarker cross validation #170

Handling too few classes in landmarker cross validation #170

Comments

bjschoenfeld commented Apr 5, 2019

emrysshevek commented May 23, 2019

bjschoenfeld commented May 30, 2019

emrysshevek commented May 30, 2019

bjschoenfeld commented May 30, 2019

emrysshevek commented May 30, 2019

bjschoenfeld commented May 30, 2019