Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gap filling with NaN values added to Level 2 #283

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

PennyHow
Copy link
Member

I have added a gap filling step so that Level 2 data should not have any gaps. Instead, gaps should now be filled with NaN values. The following steps occur:

  • Determine the datetime range of the dataset
  • Determine the most common time interval (i.e. 10 minute, hourly etc.)
  • Generate an index with no gaps
  • Reindex the dataset to the new index

I found that this was occurring at times when stations were being visited, and therefore there are gaps in the data when maintenance is being carried out and the station is offline. In most cases, this is only for a couple of hours. But then NaN gaps were not present in the Level 2 dataset - the dataset just jumped from the hour the station went offline, to when it is online again.

I am open to suggestions for where this functionality should go. For now, it is in the L1toL2 processing as a step at the end before the Level 2 dataset is returned. However, another option could be for this to go in the write function.

@PennyHow PennyHow requested a review from ladsmund August 12, 2024 13:14
common_diff = pd.Timedelta(pd.Series(time_diffs).mode()[0])

# Determine gap filled index
full_time_range = pd.date_range(start=min_date,
Copy link
Contributor

@ladsmund ladsmund Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will it behave if there is a relative time shift in the date time indexes?

Let's say we have a time series where the indices are come from two periods 01 and 02:

period_01_indexes = [2023-01-03T10:00,  2023-01-03T11:00, ....,2023-01-05T02:00]
period_02_indexes = [2023-06-01T14:10,  2023-06-01T15:10, ....,2024-07-12T18:10]

In this case, there will be a gap between 2023-01-05T02:00 and 2023-06-01T14:10. Both periods have hourly sample rates but the second part has an offset of 10 minutes.

In this case, the generated indexes from pd.data_range will ignore the offset in the second period.

  • How will xarray.Dataest.reindex behave when the indexes are slightly off?
  • Is it an irrelevant case? Why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think that it is a relevant issue.

We have cases where daily winter transmissions are taken as isolated hourly values and surrounded by NaN in the resampling process:
#244

We should have a better handling of mixed sample rates. Potentially with a time_bnds variable that I have seen in many other CF-compliant dataset.

@@ -217,6 +217,7 @@ def toL2(


ds = clip_values(ds, vars_df)
ds = fill_gaps(ds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the nan values still available in the output files or are they removed later in the pipeline?

write.py#L166 Lcsv = Lx.to_dataframe().dropna(how="all")

write.py#L471 df = df.dropna(how="all")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thing that it should be handled in the write rather than in the processing.

Actually I now remember that the resample_dataset is called when writing the L2 data:

# Write out level 2
if outpath is not None:
if not os.path.isdir(outpath):
os.mkdir(outpath)
if aws.L2.attrs['format'] == 'raw':
prepare_and_write(aws.L2, outpath, aws.vars, aws.meta, '10min')
prepare_and_write(aws.L2, outpath, aws.vars, aws.meta, '60min')

I thought this would fill the gaps within L2_raw and L2_tx data with NaN but apparently it doesn't!
So maybe it could be fixed there?

I can see that I did not make another resample after joining the raw and tx L2 data:

# Resample to hourly, daily and monthly datasets and write to file
prepare_and_write(all_ds, outpath, variables, metadata, resample = False)

So there may be some gaps between the raw and tx data. For instance for a station that failed from Jan 2024 and visited in Jun 2024, we'll have raw data until Jan 2024 and tx data from Jun 2024.

Also not that we use the attribute aws.L2.attrs['format'] to determine if a 10 min resample is needed on top of a hourly resample. aws.L2.attrs['format'] is inheritted from aws.L1A.attrs['format'] which is inherited from aws.L1[-1].attrs['format']: the format of the last logger file. If that last file is a "STM" then no 10 min data is produced, even though there has been 10min data in older logger files.

The use of .mode is also difficult to control, because when in presence of both raw and STM data, it depends on the number of occurence of a given sample rate.

Note that these gaps do not exist in the level 3 files, potentially because there is a resample in join_l3 after we merge data covering different periods:

v = pypromice.resources.load_variables(variables)
m = pypromice.resources.load_metadata(metadata)
if outpath is not None:
prepare_and_write(l3_merged, outpath, v, m, "60min")
prepare_and_write(l3_merged, outpath, v, m, "1D")
prepare_and_write(l3_merged, outpath, v, m, "M")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants