Use Feature Selection with Successive Halving and progressive_val_score #1202

IndeedPete · 2023-03-16T16:15:33Z

IndeedPete
Mar 16, 2023

Hi, I'm a bit confised here as to how the selectors from feature_selection are supposed to be combined with the progressive_val_score function and successive halving. I cannot manually iterate over the data before passing it to progressive_val_score, somehow transforming it and then putting it back into a numpy array or pandas dataset seems hacky as well. I tried adding the selector to the model pipelines under evaluation but that gives me an error, either right at the start or after a few halvings (see last code block at the bottom). I'm pasting my code below so you can get an idea what I'm trying to do.

I have an encoded pandas datset with 61,000 rows and 62 columns. I train my selector to keep around 25% of the features:

selector = feature_selection.SelectKBest(
    similarity = stats.PearsonCorr(),
    k = int(np.round(label_encoded_dataset_features.axes[1].size * 0.25))
)

for x, y in stream.iter_pandas(label_encoded_dataset_features, label_encoded_dataset_target):
    selector = selector.learn_one(x, y)

Then I'm creating some model configurations, selector included in the pipelines:

arf_params = {
    ...
}

scaled_arf_models = utils.expand_param_grid(
    (
        selector |
        preprocessing.StandardScaler() |
        forest.ARFRegressor()
    ), {
        "ARFRegressor" : arf_params
    }    
)

...

models = arf_models + ...

And here comes the model selector and evaluation:

model_selector = model_selection.SuccessiveHalvingRegressor(
    models,
    metric = metrics.MAE(),
    budget = 50000,
    eta = 2,
    verbose = True
)

evaluate.progressive_val_score(
    model = model_selector,
    dataset = stream.iter_pandas(label_encoded_dataset_features, label_encoded_dataset_target),
    metric = metrics.MAE(),
    print_every = 200
)

This gives me an error after a while:

[1]	350 removed	350 left	7 iterations	budget used: 4900	budget left: 45100	best MAE: 19.378215
Output exceeds the [size limit](command:workbench.action.openSettings?%5B%22notebook.output.textLineLimit%22%5D). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?1f4ea676-c7bc-4df1-8755-4b3dbfbb2916)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[41], line 9
      1 model_selector = model_selection.SuccessiveHalvingRegressor(
      2     models,
      3     metric = metrics.MAE(),
   (...)
      6     verbose = True
      7 )
----> 9 evaluate.progressive_val_score(
     10     model = model_selector,
     11     dataset = stream.iter_pandas(label_encoded_dataset_features, label_encoded_dataset_target),
     12     metric = metrics.MAE(),
     13     print_every = 200
     14 )

File [.venv\lib\site-packages\river\evaluate\progressive_validation.py:368](file:///.venv/lib/site-packages/river/evaluate/progressive_validation.py:368), in progressive_val_score(dataset, model, metric, moment, delay, print_every, show_time, show_memory, **print_kwargs)
    355 checkpoints = iter_progressive_val_score(
    356     dataset=dataset,
    357     model=model,
   (...)
    363     measure_memory=show_memory,
    364 )
    366 active_learning = utils.inspect.isactivelearner(model)
--> 368 for checkpoint in checkpoints:
...
--> 363     raise ValueError("Sample larger than population or is negative")
    364 result = [None] * k
    365 setsize = 21        # size of a small set minus size of an empty list

**ValueError: Sample larger than population or is negative**

However, when I remove the selector from the pipelines, it works as intended. Am I doing something wrong? Could it be a bug? What would be a better approach?

Thank you for your input!

smastelini · 2023-03-16T16:25:29Z

smastelini
Mar 16, 2023
Maintainer

Hi @IndeedPete. Without delving more into your problem I cannot say for sure what's happening.

But the error might be related to random forest. The forest samples subsets of the features to build the trees. As the number of features decrease, this might be affecting the execution.

On the other hand, the root of the error could also come from progressive validation.

(PS: opening an issue is the preferred way to report this kind of problem).

Could your share some data, to facilitate reproducing your problem?

4 replies

IndeedPete Mar 17, 2023
Author

Hey, thanks for your quick reply. As I said, I wasn't sure if I was using it correctly in the first place. Unfortunately, I'm not at liberty to share my data. However, I created an MWE based on some generated data. Using Python 3.8.9, River 0.15.0, scikit-learn 1.2.1., and numpy 1.24.2. It will reproduce the error unless you comment out the "feature_selector |" line:

from sklearn import datasets
from river import feature_selection, stats, stream, utils, preprocessing, forest, metrics, drift, model_selection, evaluate

X, y = datasets.make_regression(
    n_samples = 100,
    n_features = 10,
    n_informative = 2,
    random_state = 42
)

feature_selector = feature_selection.SelectKBest(
    similarity = stats.PearsonCorr(),
    k = 2
)

for xi, yi in stream.iter_array(X, y):
    feature_selector = feature_selector.learn_one(xi, yi)

models = utils.expand_param_grid(
    (
        feature_selector |
        preprocessing.StandardScaler() |
        forest.ARFRegressor()
    ), {
        "ARFRegressor" : {
            "n_models" : [10, 25, 50, 75, 100],
            "max_features" : [0.1, 0.25, 0.5, 0.75, None, "sqrt", "log2"],
            "metric" : [metrics.MAE()],
            "disable_weighted_vote" : [True, False],
            "drift_detector" : [drift.ADWIN()],
            "warning_detector" : [drift.ADWIN()],
            "grace_period" : [50, 100, 250, 500, 1000],
            "seed" : [42]
        }
    }  
)

model_selector = model_selection.SuccessiveHalvingRegressor(
    models,
    metric = metrics.MAE(),
    budget = 50000,
    eta = 2,
    verbose = True
)

evaluate.progressive_val_score(
    model = model_selector,
    dataset = stream.iter_array(X, y),
    metric = metrics.MAE(),
    print_every = 200
)

Thanks again!

smastelini Mar 17, 2023
Maintainer

Thanks for the MWE, @IndeedPete! I will try to give it a look during the weekend :D

smastelini Mar 19, 2023
Maintainer

Hi! Your problem seems to be caused by a combination of the choice of k for the feature selector and tuning the max_features parameter of ARF. I tried increasing the value of k, which was set to two 2. It works when set to k=10 (the total number of features). When using k=5 the problem appears again. I was able to run your example when setting k=5 and not tuning max_features at all.

Since you have only two features, or only a handful, getting subsets of features without replacement (as random forests do) becomes wonky. The code runs fine when selecting both a higher number of features and increasing k. It also works when k=2 and n_features=10, as long as you don't use a regressor like ARF, which relies on subsetting the feature set.

Just an observation: there is no need to "pre-train" the feature selector to pass it to the grid expansion. All the pipeline components are trained at the same time, when calling learn_one. In fact, a fresh copy of the selector is created for each parameter combination.

IndeedPete Mar 23, 2023
Author

Thanks for the heads-up! I managed to get it to work in my original code. Also thanks for the tip regarding pretraining!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Feature Selection with Successive Halving and progressive_val_score #1202

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Use Feature Selection with Successive Halving and progressive_val_score #1202

IndeedPete Mar 16, 2023

Replies: 1 comment · 4 replies

smastelini Mar 16, 2023 Maintainer

IndeedPete Mar 17, 2023 Author

smastelini Mar 17, 2023 Maintainer

smastelini Mar 19, 2023 Maintainer

IndeedPete Mar 23, 2023 Author

IndeedPete
Mar 16, 2023

Replies: 1 comment 4 replies

smastelini
Mar 16, 2023
Maintainer

IndeedPete Mar 17, 2023
Author

smastelini Mar 17, 2023
Maintainer

smastelini Mar 19, 2023
Maintainer

IndeedPete Mar 23, 2023
Author