Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Group IDs for rolling_time_series #885

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mmcux
Copy link

@mmcux mmcux commented Aug 19, 2021

In some cases you have multiple id columns which define a time series, i.e. factory and machine type (Factory A & Machine A; Factory B & Machine A; ...). You could now create a new single id column beforehand; this would be ideally just a tuple of the combinations. Unfortunately the roll_time_series() function can't handle them right now (at least it didn't work for me).
Therefore I adapted the code a little bit in this function to handle multiple id columns.

Moreover if the column id name is different than "id", i.e. "machine_name", this column is also present in the rolled dataframe after calling roll_time_series().

from tsfresh.utilities.dataframe_functions import roll_time_series
df_rolled = roll_time_series(df, column_id="machine_name", column_sort="time")

from tsfresh import extract_features
df_features = extract_features(df_rolled, column_id="id", column_sort="time")

A naive user (like me) would just pass the rolled dataframe directly into the extract_features function. With the above workflow the extract_features() function would

  • fail if the column "machine_name" contains strings
  • would run perfectly and calculate features for "machine_name" if the column contains numerical values -> this doesn't make sense for me as we group indirectly by this column anyway and calculating features for id's...well..

Therefore I dropped the initial id colum(s) in this PR after rolling. The id's are included in the new created "id" and will be easy accessible again after extracting the features from the rolled dataframe.

I am sure there is a lot of room for improvements in my code changes, but for my use cases it works and I would be happy to get your feedback and thoughts about this.

@codecov-commenter
Copy link

codecov-commenter commented Aug 22, 2021

Codecov Report

Merging #885 (0b28ac7) into main (ff69073) will decrease coverage by 0.33%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #885      +/-   ##
==========================================
- Coverage   95.34%   95.00%   -0.34%     
==========================================
  Files          18       18              
  Lines        1867     1882      +15     
  Branches      368      377       +9     
==========================================
+ Hits         1780     1788       +8     
- Misses         47       50       +3     
- Partials       40       44       +4     
Impacted Files Coverage Δ
tsfresh/utilities/dataframe_functions.py 92.75% <66.66%> (-3.08%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff69073...0b28ac7. Read the comment docs.

@nils-braun
Copy link
Collaborator

Great idea @mmcux! I like both your ideas (allow for multiple columns in the id as well as remove the created id, as it can be easily recreated).
However, I would propose not to allow for multiple id columns as parameter in the roll_time_series function. The reason is as following: all our "main" functions in tsfresh allow for the same input parameters (column_id, column_sort, ...) and it would be a bit inconsistent if one of them allows for a list and the others don't. We could of course now change all functions to allow for lists, but that would be a lot of work. Therefore I would propose to go the other way you suggested: make sure the roll_time_series function can actually work with a id column, which consists of tuples. There is not much to change in this case, basically one line. What do you think? If you like it, feel free to just cherry-pick the commit this link points to (I have also adjusted the tests) in your PR. We can think of having a dedicated example in the documentation for this use-case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants