Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_datasets fails to open GRIB messages of same parameter with different forecastTime values, silently skipping them #344

Open
Metamess opened this issue Jul 23, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@Metamess
Copy link
Contributor

What happened?

If a GRIB file has messages for a parameter (like 'Total precipitation' or 'tp') which is expressed as an average ('stepType' = 'avg'), and one message describes the average of the preceding hour, and one describes the average since the reference time (t=0, model start, start of prediction, etc), then cfgrib's open_datasets function is unable to recognize the difference between these two messages, and consequently only includes one (the first). The second message will not be present in the result, without any hick-up or indication to the user that the data returned is not, in fact, all the data from the GRIB file.

NOTE: This would happen to any stepType that describes some form of time interval (average, accumulation, maximum, etc). It would also ignore any amount of messages past the first, if more than one is present in the GRIB file.

I have also identified the cause of this behaviour, and potentially a (start for a) fix.

Within cfgrib, when opening a GRIB file, the enforce_unique_attributes function (in dataset.py) is used as the first step in build_variable_components the to ensure that the resulting dataset is a valid hypercube. The error raised when it is not is used by raw_open_datasets (in xarray_store.py) to keep refining a set of filter_by_keys values until the entire GRIB file can be read into hypercubes without conflicts.

Inside of the GRIB message, time time interval of the data is encoded via 'forecast time' (octets 19-22 in Section 4 of the GRIB message, called 'forecastTime' by eccodes). For a message, say, 16 hours ahead of the reference time, if the stepType is 'instant', forecastTime would be 16. If the stepType is 'avg' and the data describes the average over the preceding hour, forecastTime would be 15. And if the data describes the average since the reference time, forecastTime would be 0.

The problem is that the set of attribute keys provided to enforce_unique_attributes (DATA_ATTRIBUTES_KEYS) does not include this attribute, or any derived attribute (stepRange for example). If you add "forecastTime" to the list DATA_ATTRIBUTES_KEYS, the messages are correctly distinguished and all present in the resulting datasets.

While it is possible to supply read_keys as a kwargs to open_datasets, these only comes in with the extra_keys in build_variable_components, and are not used to enforce unique attributes. I have tried this, but it does not result in getting the 'lost' messages in the output datasets.

You can use backend_kwargs={"filter_by_keys": {"forecastTime": <some_value>}} to get the separate messages, but that requires that you know all the possible values ahead of time, and that you even know that this problem occurs. It is my understanding that the point of the open_datasets() function is to be able to fully read in a GRIB file without knowing this. As it stands, you simply don't get the data, and you wouldn't know you are missing some of the GRIB messages until you fully compare the output datasets to the input GRIB file.

The reason I am unsure if adding 'forecastTime' to DATA_ATTRIBUTES_KEYS is a desirable fix, is that it results in potentially undesirable behaviour when opening GRIB files containing messages spanning multiple timesteps. I believe that the varying values of the forecastTime attribute would force what is effectively the same parameter into different datasets. That might mean a different solution is required, or that some more work is required to prevent this from happening when it is not desired. Perhaps different attributes like lengthOfTimeRange can be of help.

What are the steps to reproduce the bug?

  • Get a GRIB file with multiple messages for the same parameter and time, but with differing time intervals.
    • I recommend a GRIB file from NCEP. One can be downloaded with ease from https://nomads.ncep.noaa.gov/gribfilter.php?ds=gfs_0p25_1hr . Make sure to select a file some time past t=0, say 10 hours ahead (the file ending with f010. Select 'ACPCP' or 'APCP' as Parameter, leave Levels to 'All' ('surface' is the only provided level for these parameters), and enter some small subregion to save data. NCEP provides these two parameters as averages both since t=0 and since the most-recent-6-hour-interval. This means that timestep 10 will have an average over the past 10 hours and an average over the past 4 hours, i.e. since t=6.
  • Verify using a tool like grib_ls that the GRIB file includes 2 messages for these parameters
  • Attempt to open the file with cfgrib.open_datasets()
  • Observe that there is only one entry per parameter

Version

0.9.10.4

Platform (OS and architecture)

WSL2 Ubuntu 22.04.2 LTS

Relevant log output

No response

Accompanying data

No response

Organisation

No response

@Metamess Metamess added the bug Something isn't working label Jul 23, 2023
@blaylockbk
Copy link
Contributor

This is related to #187 and #321. One potential solution is using xarray Datatree as proposed in #327 and xarray-contrib/datatree#195. There is a WIP here: pydata/xarray#7437

@Metamess
Copy link
Contributor Author

Thanks for the response @blaylockbk ! (and apologies for the delayed reply). It was quite a read, but that datatree looks like a very interesting project! It could indeed solve some of the issues that currently exist with mapping GRIBs to Datasets. I will definitely have a go at incorporating it in some of my scripts!

As for the issue that cfgrib silently skips GRIB messages if multiple messages have the same values for the keys the index differentiates on: Do you have a suggestion on what the desired behavior should be? I was thinking maybe a warning could be logged in these cases, to at least inform the user that the resulting data is not all of the data in the GRIB file that was opened?

@blaylockbk
Copy link
Contributor

Do you have a suggestion on what the desired behavior should be? I was thinking maybe a warning could be logged in these cases, to at least inform the user that the resulting data is not all of the data in the GRIB file that was opened?

Yes, this would be very helpful to be told that cfgrib skipped something. My preference is that cfgrib would read all the data by default just so a user doesn't misunderstand what's in the file; I think "open_dataset" should in fact read all the available data. Perhaps the data could be stacked along a new dimension or return as a datatree.

@Metamess
Copy link
Contributor Author

Metamess commented Aug 17, 2023

I will try and see if I can make a PR that makes cfgrib log a warning message when a GRIB message is skipped. Potentially that could even use the errors parameter, which is set to "warn" by default, but can be set to "raise" to raise an exception instead; But I'm not fully certain that these cases would warrant an exception. Perhaps @iainrussell could provide some input on this?

Returning the data as a datatree could be a full solution, but that would require a lot more work, especially since as of right now cfgrib does not use datatrees at all. While I would welcome it, I don't think I can spare the time personally, and I also don't know what cfgrib's maintainers' opinions are on it.

My preference is that cfgrib would read all the data by default

I have tried to dive into the issue, and I think I have to conclude that it is simply not possible. I will try to write out my findings below for future reference, and maybe it can help someone (or someone points out a mistake!). TL:DR; I don't think it can be done, and the user will always need to work around the issue by supplying specifics through a combination of the read_keys, extra_coords and filter_by_keys backend_kwargs.

One other thing I can conclude though is that there is currently no mechanism through which the user can influence what fields are used by enforce_unique_attributes. This would be required to cause cfgrib to automatically differentiate between messages. (While a combination of read_keys and filter_by_keys (as noted in #187 ) could be used, this requires the user to be able to exhaustively supply all possible values for the key). Perhaps another backend_kwargs could be added, say forced_unique_keys. These keys would behave similar to those passed to read_keys, but additionally prevent messages from being combined into a Dataset. @iainrussell would you welcome a PR adding such functionality?


Consider an example with 2 GRIB messages per timestep for the same variable "acpcp" (accumulated precipitation), but different accumulation 'schemes': one since t=0 (let's call it 'total'), and one since t=t-1 ('hourly').

  • The ideal outcome would be 1 Dataset with 2 variables. But this would require both the variables to have different names, and since the names are derived from the parameter encoding which is the same for both variables, this can't be done. So then we would want 2 Datasets, each with one variable.
  • To distinguish between the two, we need to choose a field (that will be provided to the internal call to enforce_unique_attributes) that differs between the two variables, but is the same for subsequent timesteps for the same variable. There is no such field in the GRIB file. Firstly there is no field to inherently differentiate between the two accumulation schemes. Secondly, fields like forecastTime, startStep, endStep, stepRange, lengthOfTimeRange all inform about the time interpretation of this one message, but the actual scheme behind the parameter over multiple messages can not be arbitrarily inferred without outside knowledge. In fact, at t=1, the GRIB messages are bit-for-bit identical, as both schemes provide an accumulation at t=1 since t=0.
    • startStep and forecastTime are always t=0 for 'total', but always differ for 'hourly'. This would mean 'total' would end up in 1 Dataset with a time dimension, but 'hourly' would get a Dataset per timestep, as the mechanic that allows us to not combine the messages is, well, not allowing us to combine the messages.
    • lengthOfTimeRange would be the same for 'hourly' but different for 'total', leading to the same issue.
    • stepRange is going to be different at every timestep for both.
    • endStep is always going to be the same for both.
  • None of the above fields should become a dimension. The 'scheme' could be, but that is not discernible from the message.
  • The most promising idea would be to instead use a field like stepRange as a secondary coordinate for the time dimension (step). This can actually already be done by supplying backend_kwargs={"extra_coords": {"stepRange": "step"}} to open_datasets(). But this will cause cfgrib to raise a ValueError, because there is no defined behavior on what to do if there are two distinct values for a coordinate. And in fact, I don't think there could be well defined behavior that would work for any generalized case. And even if you deal with this by splitting the messages with differing values for the coordinate into separate variables, you end up back with the issue of differentiating based on a field, and get a Dataset per message. As far as I can see, there can be no generalized way to figure out which coordinate for t=n should be combined with a coordinate for t=n+1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants