-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gap filling with NaN values added to Level 2 #283
base: develop
Are you sure you want to change the base?
Conversation
common_diff = pd.Timedelta(pd.Series(time_diffs).mode()[0]) | ||
|
||
# Determine gap filled index | ||
full_time_range = pd.date_range(start=min_date, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will it behave if there is a relative time shift in the date time indexes?
Let's say we have a time series where the indices are come from two periods 01 and 02:
period_01_indexes = [2023-01-03T10:00, 2023-01-03T11:00, ....,2023-01-05T02:00]
period_02_indexes = [2023-06-01T14:10, 2023-06-01T15:10, ....,2024-07-12T18:10]
In this case, there will be a gap between 2023-01-05T02:00 and 2023-06-01T14:10. Both periods have hourly sample rates but the second part has an offset of 10 minutes.
In this case, the generated indexes from pd.data_range
will ignore the offset in the second period.
- How will
xarray.Dataest.reindex
behave when the indexes are slightly off? - Is it an irrelevant case? Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think that it is a relevant issue.
We have cases where daily winter transmissions are taken as isolated hourly values and surrounded by NaN in the resampling process:
#244
We should have a better handling of mixed sample rates. Potentially with a time_bnds
variable that I have seen in many other CF-compliant dataset.
@@ -217,6 +217,7 @@ def toL2( | |||
|
|||
|
|||
ds = clip_values(ds, vars_df) | |||
ds = fill_gaps(ds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the nan values still available in the output files or are they removed later in the pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also thing that it should be handled in the write
rather than in the processing.
Actually I now remember that the resample_dataset
is called when writing the L2 data:
pypromice/src/pypromice/process/get_l2.py
Lines 42 to 48 in 3357e62
# Write out level 2 | |
if outpath is not None: | |
if not os.path.isdir(outpath): | |
os.mkdir(outpath) | |
if aws.L2.attrs['format'] == 'raw': | |
prepare_and_write(aws.L2, outpath, aws.vars, aws.meta, '10min') | |
prepare_and_write(aws.L2, outpath, aws.vars, aws.meta, '60min') |
I thought this would fill the gaps within L2_raw and L2_tx data with NaN but apparently it doesn't!
So maybe it could be fixed there?
I can see that I did not make another resample after joining the raw and tx L2 data:
pypromice/src/pypromice/process/join_l2.py
Lines 101 to 102 in 3357e62
# Resample to hourly, daily and monthly datasets and write to file | |
prepare_and_write(all_ds, outpath, variables, metadata, resample = False) |
So there may be some gaps between the raw and tx data. For instance for a station that failed from Jan 2024 and visited in Jun 2024, we'll have raw data until Jan 2024 and tx data from Jun 2024.
Also not that we use the attribute aws.L2.attrs['format']
to determine if a 10 min resample is needed on top of a hourly resample. aws.L2.attrs['format']
is inheritted from aws.L1A.attrs['format']
which is inherited from aws.L1[-1].attrs['format']
: the format of the last logger file. If that last file is a "STM" then no 10 min data is produced, even though there has been 10min data in older logger files.
The use of .mode
is also difficult to control, because when in presence of both raw
and STM
data, it depends on the number of occurence of a given sample rate.
Note that these gaps do not exist in the level 3 files, potentially because there is a resample
in join_l3
after we merge data covering different periods:
pypromice/src/pypromice/process/join_l3.py
Lines 531 to 536 in 3357e62
v = pypromice.resources.load_variables(variables) | |
m = pypromice.resources.load_metadata(metadata) | |
if outpath is not None: | |
prepare_and_write(l3_merged, outpath, v, m, "60min") | |
prepare_and_write(l3_merged, outpath, v, m, "1D") | |
prepare_and_write(l3_merged, outpath, v, m, "M") |
I have added a gap filling step so that Level 2 data should not have any gaps. Instead, gaps should now be filled with NaN values. The following steps occur:
I found that this was occurring at times when stations were being visited, and therefore there are gaps in the data when maintenance is being carried out and the station is offline. In most cases, this is only for a couple of hours. But then NaN gaps were not present in the Level 2 dataset - the dataset just jumped from the hour the station went offline, to when it is online again.
I am open to suggestions for where this functionality should go. For now, it is in the
L1toL2
processing as a step at the end before the Level 2 dataset is returned. However, another option could be for this to go in thewrite
function.