-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
open_datasets fails to open GRIB messages of same parameter with different forecastTime values, silently skipping them #344
Comments
This is related to #187 and #321. One potential solution is using xarray Datatree as proposed in #327 and xarray-contrib/datatree#195. There is a WIP here: pydata/xarray#7437 |
Thanks for the response @blaylockbk ! (and apologies for the delayed reply). It was quite a read, but that datatree looks like a very interesting project! It could indeed solve some of the issues that currently exist with mapping GRIBs to Datasets. I will definitely have a go at incorporating it in some of my scripts! As for the issue that cfgrib silently skips GRIB messages if multiple messages have the same values for the keys the index differentiates on: Do you have a suggestion on what the desired behavior should be? I was thinking maybe a warning could be logged in these cases, to at least inform the user that the resulting data is not all of the data in the GRIB file that was opened? |
Yes, this would be very helpful to be told that cfgrib skipped something. My preference is that cfgrib would read all the data by default just so a user doesn't misunderstand what's in the file; I think "open_dataset" should in fact read all the available data. Perhaps the data could be stacked along a new dimension or return as a datatree. |
I will try and see if I can make a PR that makes cfgrib log a warning message when a GRIB message is skipped. Potentially that could even use the Returning the data as a datatree could be a full solution, but that would require a lot more work, especially since as of right now cfgrib does not use datatrees at all. While I would welcome it, I don't think I can spare the time personally, and I also don't know what cfgrib's maintainers' opinions are on it.
I have tried to dive into the issue, and I think I have to conclude that it is simply not possible. I will try to write out my findings below for future reference, and maybe it can help someone (or someone points out a mistake!). TL:DR; I don't think it can be done, and the user will always need to work around the issue by supplying specifics through a combination of the One other thing I can conclude though is that there is currently no mechanism through which the user can influence what fields are used by Consider an example with 2 GRIB messages per timestep for the same variable "acpcp" (accumulated precipitation), but different accumulation 'schemes': one since t=0 (let's call it 'total'), and one since t=t-1 ('hourly').
|
What happened?
If a GRIB file has messages for a parameter (like 'Total precipitation' or 'tp') which is expressed as an average ('stepType' = 'avg'), and one message describes the average of the preceding hour, and one describes the average since the reference time (t=0, model start, start of prediction, etc), then cfgrib's open_datasets function is unable to recognize the difference between these two messages, and consequently only includes one (the first). The second message will not be present in the result, without any hick-up or indication to the user that the data returned is not, in fact, all the data from the GRIB file.
NOTE: This would happen to any stepType that describes some form of time interval (average, accumulation, maximum, etc). It would also ignore any amount of messages past the first, if more than one is present in the GRIB file.
I have also identified the cause of this behaviour, and potentially a (start for a) fix.
Within cfgrib, when opening a GRIB file, the
enforce_unique_attributes
function (indataset.py
) is used as the first step inbuild_variable_components
the to ensure that the resulting dataset is a valid hypercube. The error raised when it is not is used byraw_open_datasets
(inxarray_store.py
) to keep refining a set offilter_by_keys
values until the entire GRIB file can be read into hypercubes without conflicts.Inside of the GRIB message, time time interval of the data is encoded via 'forecast time' (octets 19-22 in Section 4 of the GRIB message, called 'forecastTime' by eccodes). For a message, say, 16 hours ahead of the reference time, if the stepType is 'instant', forecastTime would be 16. If the stepType is 'avg' and the data describes the average over the preceding hour, forecastTime would be 15. And if the data describes the average since the reference time, forecastTime would be 0.
The problem is that the set of attribute keys provided to
enforce_unique_attributes
(DATA_ATTRIBUTES_KEYS
) does not include this attribute, or any derived attribute (stepRange
for example). If you add "forecastTime" to the listDATA_ATTRIBUTES_KEYS
, the messages are correctly distinguished and all present in the resulting datasets.While it is possible to supply
read_keys
as a kwargs to open_datasets, these only comes in with theextra_keys
inbuild_variable_components
, and are not used to enforce unique attributes. I have tried this, but it does not result in getting the 'lost' messages in the output datasets.You can use
backend_kwargs={"filter_by_keys": {"forecastTime": <some_value>}}
to get the separate messages, but that requires that you know all the possible values ahead of time, and that you even know that this problem occurs. It is my understanding that the point of theopen_datasets()
function is to be able to fully read in a GRIB file without knowing this. As it stands, you simply don't get the data, and you wouldn't know you are missing some of the GRIB messages until you fully compare the output datasets to the input GRIB file.The reason I am unsure if adding 'forecastTime' to
DATA_ATTRIBUTES_KEYS
is a desirable fix, is that it results in potentially undesirable behaviour when opening GRIB files containing messages spanning multiple timesteps. I believe that the varying values of the forecastTime attribute would force what is effectively the same parameter into different datasets. That might mean a different solution is required, or that some more work is required to prevent this from happening when it is not desired. Perhaps different attributes like lengthOfTimeRange can be of help.What are the steps to reproduce the bug?
f010
. Select 'ACPCP' or 'APCP' as Parameter, leave Levels to 'All' ('surface' is the only provided level for these parameters), and enter some small subregion to save data. NCEP provides these two parameters as averages both since t=0 and since the most-recent-6-hour-interval. This means that timestep 10 will have an average over the past 10 hours and an average over the past 4 hours, i.e. since t=6.grib_ls
that the GRIB file includes 2 messages for these parameterscfgrib.open_datasets()
Version
0.9.10.4
Platform (OS and architecture)
WSL2 Ubuntu 22.04.2 LTS
Relevant log output
No response
Accompanying data
No response
Organisation
No response
The text was updated successfully, but these errors were encountered: