Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roll time series only give "step by step window" for max_timeshift window, but have left alone windows from 1 to max_timeshift window? #1079

Open
heib6xinyu opened this issue Jun 14, 2024 · 1 comment
Labels

Comments

@heib6xinyu
Copy link

heib6xinyu commented Jun 14, 2024

The problem:
For the roll time series method, there seems to be an issue regarding the windows it makes (or I understand the function wrong). In short, it looks like only the window of max_timeshift are properly formed.
It is a complicated issue to detect and to explain the process of discovering it. I will try my best.
I may be understanding the functionality of roll_time_series wrong. But from the description of the roll time series, I am expecting the function to create continuous rolling window of size between min_timeshift and max_timeshift. For example, if I have product a to g, of a time period 0 to 4, and some feature related to them. Say I have min_timeshift 1, max_timeshift 3. Then after I run this data through roll time series, I should have some frame looks like this:
For product a:
window of shift 1
id timestep features
(a, 1) 0 f1 f2...
(a, 1) 1 f1 f2...
(a, 2) 1 f1 f2...
(a, 2) 2 f1 f2...
(a, 3) 2 f1 f2...
(a, 3) 3 f1 f2...
(a, 4) 3 f1 f2...
(a, 4) 4 f1 f2...
(a, 5) 4 f1 f2...
(a, 5) 5 f1 f2...
window of shift 2
id timestep features
(a, 2) 0 f1 f2...
(a, 2) 1 f1 f2...
(a, 2) 2 f1 f2...
(a, 3) 1 f1 f2...
(a, 3) 2 f1 f2...
(a, 3) 3 f1 f2...
(a, 4) 2 f1 f2...
(a, 4) 3 f1 f2...
(a, 4) 4 f1 f2...
(a, 5) 3 f1 f2...
(a, 5) 4 f1 f2...
(a, 5) 5 f1 f2...
etc.
however my discovery of how this function actually performs does not align with the expectation, which is as follow:
window of shift 1:
(a, 1) 0 f1 f2...
(a, 1) 1 f1 f2...
window of shift 2:
(a, 2) 0 f1 f2...
(a, 2) 1 f1 f2...
(a, 2) 2 f1 f2...
window of shift 3:
(a, 3) 0 f1 f2...
(a, 3) 1 f1 f2...
(a, 3) 2 f1 f2...
(a, 3) 3 f1 f2...
(a, 4) 1 f1 f2...
(a, 4) 2 f1 f2...
(a, 4) 3 f1 f2...
(a, 4) 4 f1 f2...
(a, 5) 2 f1 f2...
(a, 5) 3 f1 f2...
(a, 5) 4 f1 f2...
(a, 5) 5 f1 f2...
I guess the issue could be the data from window of shift 2 will overwrite most of the window of shift 1's data, since the id will be the same (product id, end timestep), the only untouched data from window of shift 1 is as follow:
(a, 1) 0 f1 f2...
(a, 1) 1 f1 f2...
But I can't tell for sure. Unless this is exactly what the function is intended to do. But then I am also confused about why make stand alone window of size less than max_timeshift, what is the purpose of those?
I cannot provide my dataset, but you can create dummy data as I described, and run the following scripts to see what I mean.

window_sizes = [1, 2, 3]
data = {}
for id, group in example_frame.groupby("id"):
  if group.shape[0] not in data:
    id_str,time = id
    data[group.shape[0]] = {id_str:[group]}
  else:
    id_str,time = id
    if id_str not in data[group.shape[0]]:
      data[group.shape[0]][id_str] = [group]
    else: 
      data[group.shape[0]][id_str].append(group)

The above code will put the rolling frame of different size (for my example, window of shift 1 has size 2, window of shift 2 has size 3...) into a dictionary. The form of this data dictionary is as follow:
data = {window_size: {id_str:[rolling_frames of id_str that has the size of window_size]}}
Then, for the window size in data.keys(), you can run the following, replace the number 3 in data[3] with window sizes (ex. 2,3,4).

for key in data[3].keys():
  if len(data[3][key]) not in length:
    length[len(data[3][key])] = [key]
  else:
    length[len(data[3][key])].append(key) 

This is to see for the products that has the size 2,3 and 4, how many rolling windows are in it. length dictionary will have key of number of rolling frame, value of product id that has length.keys() many rolling frame.
Then you will find, except for data[4], which is the window size caused by the max_timeshift = 3, all the result of running length.keys() is dict_keys([1]), meaning for all product id of window size 2 and 3, they consist of only 1 rolling window.
For example, if I run data[3]['a'], I will have only this result:
[(a, 2) 0 f1 f2...
(a, 2) 1 f1 f2...
(a, 2) 2 f1 f2...]
But for data[4]['a'], I'll have he following result:
[(a, 3) 0 f1 f2...
(a, 3) 1 f1 f2...
(a, 3) 2 f1 f2...
(a, 3) 3 f1 f2... ,
(a, 4) 1 f1 f2...
(a, 4) 2 f1 f2...
(a, 4) 3 f1 f2...
(a, 4) 4 f1 f2... ,
(a, 5) 2 f1 f2...
(a, 5) 3 f1 f2...
(a, 5) 4 f1 f2...
(a, 5) 5 f1 f2...]

Anything else we need to know?:

Environment:

  • Python version: 3.10
  • Operating System: window
  • tsfresh version: 0.20.1
  • Install method (conda, pip, source): pip install
@heib6xinyu heib6xinyu added the bug label Jun 14, 2024
@heib6xinyu heib6xinyu changed the title roll time series not performing as description roll time series only give "step by step window" for max_timeshift window, but have left alone windows from 1 to max_timeshift window? Jun 14, 2024
@nils-braun
Copy link
Collaborator

Hi @heib6xinyu !
Thank you very much for providing such a nice example alongside all the code you used to generate it.
However, I think that there is indeed a misunderstanding on what the function does.
You can find everything I describe in the following also in our docs here and especially here, which contains an example (using similar data as you used actually!) of the rolling mechanism all written out.

But in short: as the name suggests, the function "rolls" a window over your data. It will try to make the window always as large as possible (*) until it either reaches the end of the data or until it reaches the maximum timeshift parameter. Every window smaller than the minimum timeshift window will be removed. It will not create all possible windows between min and max timeshift.
So the behaviour you are seeing is actually expected.

(*) if you ask why this is the case: feature extraction and any further ML after that works best/makes the most sense if all windows have the same size. So in principle it would be best to have the minimum timeshift parameter set to the maximum timeshift parameter. As this might be a bit "wasteful" (we would throw away a lot of data) we give users the option to choose both parameters independently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants