Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT add the Frequency Encoder #707

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

coconutattitude
Copy link

Addresses #706 issue.
Here's a first iteration of the class and it's a proof of concept made during EuroSciPy 2023. Documentation and tests are needed. I'll add quantile strategies in the future. What do you think about this proof of concept?

@Vincent-Maladiere
Copy link
Member

Thanks for this feature, @coconutattitude
Here is a quick roadmap to improve this PR before engaging in more formal reviews and discussions.

  1. Use pre-commit to linter and standardize your code (it is usually run during git commit).
    To do so:

    pip install -e '.[dev]'

    to install the dependencies used for development, then:

    pre-commit init

    to initialize pre-commit.

  2. Add a get_feature_name_out() method.
    This is useful for inspecting the data used to fit transformers.
    You can take inspiration from the GapEncoder method or DatetimeEncoder method

  3. Check for input types, i.e. what happens when X is a numpy array? What happens when it's a pandas dataframe?
    Most encoders/estimators across skrub use the check_input() utils function.

    if isinstance(X, pd.DataFrame):
    self.col_names_ = X.columns.to_list()
    else:
    self.col_names_ = None
    X = check_input(X)

  4. Add basic tests in a dedicated _test_frequency_encoder.py file test. This will help discover edge cases and situations in which the estimator would fail.

  5. Convert your example notebook into a Python file using jupytext

    pip install jupytext
    

    then:

    jupytext --set-formats ipynb,py:percent example/08_frequence_encoder.ipynb
    jupytext --sync example/08_frequence_encoder.ipynb
    

    We prefer Python files over notebooks because we can review them in PR more easily, and they are more compact.

  6. Enrich your example by adding a regression/classification or even clustering task. Reproducing a similar use case to the one you mentioned IRL with IP addresses could bring value.

@Vincent-Maladiere Vincent-Maladiere changed the title add frequency encoder FEAT add the Frequency Encoder Aug 23, 2023
@Vincent-Maladiere
Copy link
Member

Hi @coconutattitude, are you planning to continue this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants