Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gap filling with NaN values added to Level 2 #283

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions src/pypromice/process/L1toL2.py
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,7 @@ def toL2(


ds = clip_values(ds, vars_df)
ds = fill_gaps(ds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the nan values still available in the output files or are they removed later in the pipeline?

write.py#L166 Lcsv = Lx.to_dataframe().dropna(how="all")

write.py#L471 df = df.dropna(how="all")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thing that it should be handled in the write rather than in the processing.

Actually I now remember that the resample_dataset is called when writing the L2 data:

# Write out level 2
if outpath is not None:
if not os.path.isdir(outpath):
os.mkdir(outpath)
if aws.L2.attrs['format'] == 'raw':
prepare_and_write(aws.L2, outpath, aws.vars, aws.meta, '10min')
prepare_and_write(aws.L2, outpath, aws.vars, aws.meta, '60min')

I thought this would fill the gaps within L2_raw and L2_tx data with NaN but apparently it doesn't!
So maybe it could be fixed there?

I can see that I did not make another resample after joining the raw and tx L2 data:

# Resample to hourly, daily and monthly datasets and write to file
prepare_and_write(all_ds, outpath, variables, metadata, resample = False)

So there may be some gaps between the raw and tx data. For instance for a station that failed from Jan 2024 and visited in Jun 2024, we'll have raw data until Jan 2024 and tx data from Jun 2024.

Also not that we use the attribute aws.L2.attrs['format'] to determine if a 10 min resample is needed on top of a hourly resample. aws.L2.attrs['format'] is inheritted from aws.L1A.attrs['format'] which is inherited from aws.L1[-1].attrs['format']: the format of the last logger file. If that last file is a "STM" then no 10 min data is produced, even though there has been 10min data in older logger files.

The use of .mode is also difficult to control, because when in presence of both raw and STM data, it depends on the number of occurence of a given sample rate.

Note that these gaps do not exist in the level 3 files, potentially because there is a resample in join_l3 after we merge data covering different periods:

v = pypromice.resources.load_variables(variables)
m = pypromice.resources.load_metadata(metadata)
if outpath is not None:
prepare_and_write(l3_merged, outpath, v, m, "60min")
prepare_and_write(l3_merged, outpath, v, m, "1D")
prepare_and_write(l3_merged, outpath, v, m, "M")

return ds


Expand Down Expand Up @@ -770,6 +771,35 @@ def calcCorrectionFactor(Declination_rad, phi_sensor_rad, theta_sensor_rad,

return CorFac_all

def fill_gaps(ds):
'''Fill data gaps with nan values

Parameters
----------
ds : xarray.Dataset
Data set to gap fill

Returns
-------
ds_filled : xarray.Dataset
Gap-filled dataset
'''
# Determine time range of dataset
min_date = ds.to_dataframe().index.min()
max_date = ds.to_dataframe().index.max()

# Determine common time interval
time_diffs = np.diff(ds['time'].values)
common_diff = pd.Timedelta(pd.Series(time_diffs).mode()[0])

# Determine gap filled index
full_time_range = pd.date_range(start=min_date,
Copy link
Contributor

@ladsmund ladsmund Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will it behave if there is a relative time shift in the date time indexes?

Let's say we have a time series where the indices are come from two periods 01 and 02:

period_01_indexes = [2023-01-03T10:00,  2023-01-03T11:00, ....,2023-01-05T02:00]
period_02_indexes = [2023-06-01T14:10,  2023-06-01T15:10, ....,2024-07-12T18:10]

In this case, there will be a gap between 2023-01-05T02:00 and 2023-06-01T14:10. Both periods have hourly sample rates but the second part has an offset of 10 minutes.

In this case, the generated indexes from pd.data_range will ignore the offset in the second period.

  • How will xarray.Dataest.reindex behave when the indexes are slightly off?
  • Is it an irrelevant case? Why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think that it is a relevant issue.

We have cases where daily winter transmissions are taken as isolated hourly values and surrounded by NaN in the resampling process:
#244

We should have a better handling of mixed sample rates. Potentially with a time_bnds variable that I have seen in many other CF-compliant dataset.

end=max_date,
freq=common_diff)

# Apply gap-fille index to dataset
ds_filled = ds.reindex({'time': full_time_range}, fill_value=np.nan)
return ds_filled

def _checkSunPos(ds, OKalbedos, sundown, sunonlowerdome, TOA_crit_nopass):
'''Check sun position
Expand Down
Loading