Multiple Group IDs for rolling_time_series #885

mmcux · 2021-08-19T21:42:04Z

In some cases you have multiple id columns which define a time series, i.e. factory and machine type (Factory A & Machine A; Factory B & Machine A; ...). You could now create a new single id column beforehand; this would be ideally just a tuple of the combinations. Unfortunately the roll_time_series() function can't handle them right now (at least it didn't work for me).
Therefore I adapted the code a little bit in this function to handle multiple id columns.

Moreover if the column id name is different than "id", i.e. "machine_name", this column is also present in the rolled dataframe after calling roll_time_series().

from tsfresh.utilities.dataframe_functions import roll_time_series
df_rolled = roll_time_series(df, column_id="machine_name", column_sort="time")

from tsfresh import extract_features
df_features = extract_features(df_rolled, column_id="id", column_sort="time")

A naive user (like me) would just pass the rolled dataframe directly into the extract_features function. With the above workflow the extract_features() function would

fail if the column "machine_name" contains strings
would run perfectly and calculate features for "machine_name" if the column contains numerical values -> this doesn't make sense for me as we group indirectly by this column anyway and calculating features for id's...well..

Therefore I dropped the initial id colum(s) in this PR after rolling. The id's are included in the new created "id" and will be easy accessible again after extracting the features from the rolled dataframe.

I am sure there is a lot of room for improvements in my code changes, but for my use cases it works and I would be happy to get your feedback and thoughts about this.

codecov-commenter · 2021-08-22T15:11:40Z

Codecov Report

Merging #885 (0b28ac7) into main (ff69073) will decrease coverage by 0.33%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##             main     #885      +/-   ##
==========================================
- Coverage   95.34%   95.00%   -0.34%     
==========================================
  Files          18       18              
  Lines        1867     1882      +15     
  Branches      368      377       +9     
==========================================
+ Hits         1780     1788       +8     
- Misses         47       50       +3     
- Partials       40       44       +4

Impacted Files	Coverage Δ
tsfresh/utilities/dataframe_functions.py	`92.75% <66.66%> (-3.08%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff69073...0b28ac7. Read the comment docs.

nils-braun · 2021-08-22T15:49:13Z

Great idea @mmcux! I like both your ideas (allow for multiple columns in the id as well as remove the created id, as it can be easily recreated).
However, I would propose not to allow for multiple id columns as parameter in the roll_time_series function. The reason is as following: all our "main" functions in tsfresh allow for the same input parameters (column_id, column_sort, ...) and it would be a bit inconsistent if one of them allows for a list and the others don't. We could of course now change all functions to allow for lists, but that would be a lot of work. Therefore I would propose to go the other way you suggested: make sure the roll_time_series function can actually work with a id column, which consists of tuples. There is not much to change in this case, basically one line. What do you think? If you like it, feel free to just cherry-pick the commit this link points to (I have also adjusted the tests) in your PR. We can think of having a dedicated example in the documentation for this use-case.

mmcux added 3 commits August 19, 2021 21:08

multiple group ids for roll_time_series

9a76dbd

added tests for mutliple group_ids in rolling

efc7d3e

style changes with pre-commit

0b28ac7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple Group IDs for rolling_time_series #885

Multiple Group IDs for rolling_time_series #885

mmcux commented Aug 19, 2021 •

edited

Loading

codecov-commenter commented Aug 22, 2021 •

edited

Loading

nils-braun commented Aug 22, 2021

Multiple Group IDs for rolling_time_series #885

Are you sure you want to change the base?

Multiple Group IDs for rolling_time_series #885

Conversation

mmcux commented Aug 19, 2021 • edited Loading

codecov-commenter commented Aug 22, 2021 • edited Loading

Codecov Report

nils-braun commented Aug 22, 2021

mmcux commented Aug 19, 2021 •

edited

Loading

codecov-commenter commented Aug 22, 2021 •

edited

Loading