Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group by calculation inside the pipe #897

Open
MislavSag opened this issue Feb 8, 2023 · 0 comments
Open

Group by calculation inside the pipe #897

MislavSag opened this issue Feb 8, 2023 · 0 comments

Comments

@MislavSag
Copy link

MislavSag commented Feb 8, 2023

Hi,

Recently, I am trying to build mlr3 pipeline (graph) for predicting financial outcomes (financial time series).

In preprocessing step, I often need to apply some function on group by basis. More concretely, I need to apply some function by month.

I have already opened an issue with an example: winsorization by groups: mlr-org/mlr3pipelines#583
In that example, I want to winsorize the data for every month (or every quarter). I doesn't have much sense to winsorize the data across time dimension. So I need month column (or quarter column). But month column is not a feature. It is not a target. I can set a role of that feature to group in the beginning, but how should I used it than. I can get the group column if I use .train_task in Preprocesing pipe, but I actually need .train_dt method.

The problem is more general because instead of winsorization, I could use scaling by group or any other function.

I kindly ask for your recommendation, what is the best way to implement above Pipe?

The solution I thought about:

  1. Set month (or more generally date) column to group. Than, if group is set, apply function (say scaling) on group by basis.
  2. Use month (or date) column as feature but exclude this column in other preprocessing operation (for example we don't want to scale dates).
  3. Set row ids to date and use that for grouping.

EDIT:

Maybe I can put questions more generally. What approach do you recommend if we want to use some columns in preprocessing, but we don't want to use them as fetures or give them other colun roles?

I am aware of mlr3temporal package which had inherited Task class and created the new, TaskForecast class. Maybe I should use this task in my case? And what if I had id and date columns, should I create my own task (TaskPanel for example) by inheriting Task?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant