Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding ordinal variables #613

Open
david-cortes opened this issue Feb 3, 2023 · 7 comments
Open

Encoding ordinal variables #613

david-cortes opened this issue Feb 3, 2023 · 7 comments

Comments

@david-cortes
Copy link
Contributor

david-cortes commented Feb 3, 2023

Oftentimes, one wants to build linear models having ordinal variables as features (e.g. "rate in a scale from 1 to 5 ..."). One might treat these as numerical or categorical, but this loses some information.

Would be nice to have ordinal versions of some typical categorical encoders, such as mean/frequency encoders that would do the grouping by a condition x<=c instead of x==c.

@solegalli
Copy link
Collaborator

Hi @david-cortes

Thanks for the suggestion!

I am not sure I understand what the output of the encoder should be. Could you give us an example?

@david-cortes
Copy link
Contributor Author

For example, if there is a column taking possible values [1, 2, 3] and we want an ordinal mean encoding, the mapping would be:

1 -> mean(y[x <= 1])
2 -> mean(y[x <= 2])
3 -> mean(y[x <= 3])

i.e. a mean calculated by grouping over rows that have a value <= than a threshold in the column being encoded (so the calculation for a value of 2 would also involve rows with a value of 1), instead of a mean calculated by grouping over each value separately.

@solegalli
Copy link
Collaborator

thank you

@AnotherSamWilson
Copy link

I second this. This type of encoding is very useful for linear modeling especially. It has an averaging effect on ordinal variables that is much more stable than simple one-hot encoding.

@solegalli if I get a pull request together along with examples of how it is beneficial, is this something the team would consider merging?

@solegalli
Copy link
Collaborator

Hey @AnotherSamWilson

Thanks for joining this discussion.

Yes, we tend to be quite open towards new functionality.

I've never heard of / read about this type of encoding. Is there an article that you could link for more info? Or is this something that you guys do practically? common practice in some industry?

To make it meaningful for potential users, we would have to add, besides the functionality, a good user guide with examples of how to use this class, and explanations about what constitutes a good use case for this type of encoding. You seem to have it covered though, because you mention examples of how this would be beneficial. So go for it!

I look forward to the PR :)

@kylegilde
Copy link
Contributor

Couldn't this be accomplished by using ArbitraryDiscretiser followed by MeanEncoder?

@david-cortes
Copy link
Contributor Author

Couldn't this be accomplished by using ArbitraryDiscretiser followed by MeanEncoder?

No, because it'd require overlaps between rows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants