-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add option to remove windows with poor data quality #1059
base: main
Are you sure you want to change the base?
add option to remove windows with poor data quality #1059
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@jasminerienecker Thanks for the PR, I think I understand what you're trying to do. Isn't it easier to remove the missing datapoints / large gaps from your dataset before training? |
@elephaint Going through the code it seems the base_windows class assumes all the timesteps are available. For example if your data is at one minute resolution but there is a gap of 10 minutes, the windows are created as if no timesteps are missing. This means I think you'd have to keep the rows with missing values in the dataset, but if there are longer chunks of missing data (as in most of the values in a window are NaN) this could interfere with the model training. This solution was a way of keeping the temporal information while not training the model on windows where the majority of the data is not available. Please let me know if there's something I've missed though! |
@jasminerienecker Thanks; I think this PR could be a generalization of #1036 (@jose-moralez). I have to think about the behaviour and we also would have to include the changes in the other Base classes. |
@marcopeix Now that I've had more time to think about it, I think this is a nice addition, wdyt? |
This review adds the parameters data_availability_threshold (defaults to 0.0 to maintain currently functionality) to all models that inherit the BaseWindows class. This parameters allows us to discard windows where the percentage of good quality data points is below the threshold. The quality of a data point is determined by the corresponding value in the available_mask column.
This is a functionality I currently require as my dataset has many large gaps and I don't want to be training the model using these gaps.
I have added a test to the end of base_windows notebook.