You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently a split is only allowed at times when no interval of data overlaps, this gets around the problem of a plugin missing data from one of its dependencies that overlaps with an interval of another dependency. This solution adds a lot of complexity when aligning chunks for plugins that take multiple inputs as well as windowing computation.
Proposed solution
A possible solution to this would be to split inclusively and concat exclusively, meaning the rule for splitting at any given time is to include overlapping intervals in both sides of the split but when concatenating two datasets intervals are only taken from each chunk if they started within the half-open interval of validity of the chunk. This will mean that when you have intervals that overlap the split time those intervals will be processed twice, but if chunk size is reasonable the affect of one additional row should be negligible on compute time. This approach would eliminate the need for a special plugin type for windowing operations, since all plugins can potentially compute overlapping chunks. Each plugin can define how much overlap on each side they want and each of the overlapping chunks would be processed in parallel, the potential extra overlap in the output would only be stripped when concatenating two adjacent chunks. Chunks include "start" and "end" fields, defining the half-open interval on which to select data when concatenating with an adjacent (therefore potentially overlapping) chunk, this can be done when adjacent chunks are collected into local memory for the next step of processing to ensure that all data is included in at least one chunk.
The text was updated successfully, but these errors were encountered:
Hej Yossi, thanks but could you maybe add an image for explanation? I think, I got your idea just want to make sure. One question though:
the potential extra overlap in the output would only be stripped when concatenating two adjacent chunks. Chunks include "start" and "end" fields, defining the half-open interval on which to select data when concatenating with an adjacent (therefore potentially overlapping) chunk
Do you mean we write the overlapping data twice to disk and only remove the doubled data during loading?
@WenzDaniel indeed an image would explain better. here are the two scenarios:
As far as the delayed cutting of outputs scheme, I think if we go down that route, we should probably re-chunk before saving and use the chance to have a validation step that we are not losing data on the cut when concatenating for the rechunk step.
Alternatively we can just cut the outputs on their validity interval right after running the compute method, but we would probably want to have a flag to set this and start with validating everything and if we see no data would have been lossed then we remove the flag.
What is the problem?
Currently a split is only allowed at times when no interval of data overlaps, this gets around the problem of a plugin missing data from one of its dependencies that overlaps with an interval of another dependency. This solution adds a lot of complexity when aligning chunks for plugins that take multiple inputs as well as windowing computation.
Proposed solution
A possible solution to this would be to split inclusively and concat exclusively, meaning the rule for splitting at any given time is to include overlapping intervals in both sides of the split but when concatenating two datasets intervals are only taken from each chunk if they started within the half-open interval of validity of the chunk. This will mean that when you have intervals that overlap the split time those intervals will be processed twice, but if chunk size is reasonable the affect of one additional row should be negligible on compute time. This approach would eliminate the need for a special plugin type for windowing operations, since all plugins can potentially compute overlapping chunks. Each plugin can define how much overlap on each side they want and each of the overlapping chunks would be processed in parallel, the potential extra overlap in the output would only be stripped when concatenating two adjacent chunks. Chunks include "start" and "end" fields, defining the half-open interval on which to select data when concatenating with an adjacent (therefore potentially overlapping) chunk, this can be done when adjacent chunks are collected into local memory for the next step of processing to ensure that all data is included in at least one chunk.
The text was updated successfully, but these errors were encountered: