Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyaerocom's method for time colocation can be inadequate for some edge cases #762

Closed
lewisblake opened this issue Oct 27, 2022 · 3 comments
Labels
invalid This doesn't seem right

Comments

@lewisblake
Copy link
Member

Story time.

Previously, @dulte raised the issue of "Missing model data is interpolated when colocate_time=True " (#736) and tried to address it through a masking approach in #755.

I offered to help with this, which me led to

def _colocate_site_data_helper_timecol(
and some discoveries.

  1. In order to colocate in time, pyaerocom resamples the time using the lowest temporal resolution, which also may be offset between the model and the observations.
  2. The DataFrames from from this resampling are then concatenated using using the keys from both the ref (aka observations) and data (aka model).
  3. The concatenated DataFrame, called merged is then has it's values interpolated according to the observation's index obs_index

Here is the practical meaning of this, based on a test coming from

def test__colocate_site_data_helper_timecol(
using the fake station data S4 and S3 from the parameterization. If you are interested in recreating these results, you can do so by running pytest on this specific function.

In this example, we are considering concpm10. The model output stat_data is at a daily resolution start 2010-01-03:

stat_data
StationData: {'dtime': array(['2010-01-03T00:00:00.000000000', '2010-01-04T00:00:00.000000000',
       '2010-01-05T00:00:00.000000000', '2010-01-06T00:00:00.000000000',
       '2010-01-07T00:00:00.000000000', '2010-01-08T00:00:00.000000000',
       '2010-01-09T00:00:00.000000000', '2010-01-10T00:00:00.000000000',
       '2010-01-11T00:00:00.000000000', '2010-01-12T00:00:00.000000000',
       '2010-01-13T00:00:00.000000000', '2010-01-14T00:00:00.000000000',
...

The observations, stat_data_ref are at a somewhat unusual time resolution: "ts_type" : "13daily" (but not out of the question by any means) starting 2010-01-01:

stat_data_ref
StationData: {'dtime': array(['2010-01-01T00:00:00.000000000', '2010-01-14T00:00:00.000000000',
       '2010-01-27T00:00:00.000000000', '2010-02-09T00:00:00.000000000',
       '2010-02-22T00:00:00.000000000', '2010-03-07T00:00:00.000000000',
       '2010-03-20T00:00:00.000000000', '2010-04-02T00:00:00.000000000',
       '2010-04-15T00:00:00.000000000', '2010-04-28T00:00:00.000000000',
...

After resampling to the lowest (coarsest) resolution, the model output stat_data, which was at available at a daily frequency, is now at a "biweekly" frequency.

stat_data
StationData: {'dtime': array(['2010-01-03T00:00:00.000000000', '2010-01-16T00:00:00.000000000',
       '2010-01-29T00:00:00.000000000', '2010-02-11T00:00:00.000000000',
       '2010-02-24T00:00:00.000000000', '2010-03-09T00:00:00.000000000',
       '2010-03-22T00:00:00.000000000', '2010-04-04T00:00:00.000000000',
       '2010-04-17T00:00:00.000000000', '2010-04-30T00:00:00.000000000',
       '2010-05-13T00:00:00.000000000', '2010-05-26T00:00:00.000000000',

The indices where the observations are null are stored in obs_isnan, and the obs_idx is also stored. The two two DataFrames are then concatenated into merged. Note the differences in the time steps.

merged
             ref       data
2010-01-01  10.0        NaN
2010-01-03   NaN   6.923077
2010-01-14   NaN        NaN
2010-01-16   NaN  10.000000
2010-01-27  10.0        NaN
...          ...        ...
2011-10-28   NaN  10.000000
2011-11-10   NaN  10.000000
2011-11-23   NaN  10.000000
2011-12-06   NaN  10.000000
2011-12-19   NaN  10.000000

The merged DataFrame is then interpolated using the "index" method from Pandas and reindexed according to the observation's index.

merged
             ref       data
2010-01-01  10.0        NaN
2010-01-14  10.0   9.526627
2010-01-27  10.0  10.000000
2010-02-09  10.0  10.000000
2010-02-22  10.0  10.000000
2010-03-07  10.0  10.000000
2010-03-20  10.0  10.000000
2010-04-02  10.0  10.000000
...

Note that the model output data still has a NaN in it (to the point of the original issue). Also, due to resampling the model to the lowest resolution, places where we previously had data are now filled with interpolated values. This is the so called "phantom model data" in #736.

Moreover, the interpolation scheme based on the index fails to consider what the values actually represent. For example, data reported at a monthly resolution might represent the mean concpm10 over that month but the index-based interpolation scheme will simply return a value based on the indexing from a linear interpolation of the neighboring valid entries. When dealing with daily, weekly, and monthly data, this is less of an issue. However for data which are collected over nonstandard intervals and reported at non-standard frequencies, this approach can be highly sub-optimal (e.g., EBAS, VOCs).

@dulte and I brought this up with @jgriesfeller and he suggested the best way to proceed is to resample down to the finest temporal resolution, thereby reducing but not eliminating the errors. A scientific discussion should probably motivate the programming solution in this case. And implementing the solution will likely be challenging.

@lewisblake lewisblake added the invalid This doesn't seem right label Oct 27, 2022
@lewisblake
Copy link
Member Author

@lewisblake
Copy link
Member Author

Note: In colocate_gridded_ungridded, line 727, the model data is temporally cropped. But what about the observations?

@lewisblake
Copy link
Member Author

lewisblake commented Nov 7, 2022

Observations are also cropped, assuming a climatology reference is not used, starting on lines 748 and 749. Here the obs_start and obs_stop are defined in terms of the model start and stop. These are given to data_ref.to_station_data_all(), which applied them to each of the StationData objects in the UngriddedData object data_ref (I followed this through the stack). Only later on, in colocation.py::colocate_gridded_ungridded is _colocate_site_data_helper_timecol currently called in the pyaerocom code, apart from the tests. So the fix @avaldebe mentioned in the meeting today of ignoring observations outside the model time domain is already implemented, which is equivalent to accounting for the time delta in this case. Although the test test__colocate_site_data_helper_timecol tests for a case in which the obs and model have a difference start and stop, this is not actually a concern in the code. Perhaps _colocate_site_data_helper_timecol could be rewritten to raise an Exception if the start and stop are different between model and obs, but a test would need to be rewritten and this does actually occur in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

1 participant