-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error using CESM variable names with data catalog #719
Comments
Hi @bitterbark in my example CESM catalog generation, the standard_name was filled out. Your csv does not have it. Can you double check your config yaml header list and the call to the catalog builder if it looks like what I passed along to you? Ref NOAA-GFDL/CatalogBuilder#89 @wrongkindofdoctor, Could standard_name missing be the root cause of the MDTF errors @bitterbark is seeing? |
It makes sense that missing standard_name would cause the problem. However, I'm not sure where it would get set. I see standard_name in the config yaml header list
As for the call to the catalog builder, do you mean this?
I haven't changed anything other than in catalogbuilder/intakebuilder/dat/gfdlcmipfreq.yaml
And using the cesm-template.yaml file you sent. |
@bitterbark the slow==True option is missing in your call. This is important for the builder to look for the standard name or long name in the header of the netcdf.
|
Thanks, looks like we're a step closer. The csv file now has the standard_name in it:
Unfortunately the MDTF error remains the same. Is the standard_name supposed to be enough? |
One aspect of the problem: src/preprocessor.py:query_catalog()
Vs. the catalog is reading the long_name from the variable and assigning standard name in the catalog based on that:
Is there something in place here that replaces the old variable-name translation files that we would have just specified which variable is which? |
The name translation seems to be happening okay: |
The catalog looks like it loaded fine, this is it printed as the MDTF runs. Update: if I query the catalog with just {"variable_id": "FLUT"} it matches the two pertinent records.
Whereas if you look at the catalog (html link at top of this post), the standard_name in the catalog is Upwelling_longwave_flux_at_top_of_model (from the long_name in the file) However, if I take standard_name out of the query, it still doesn't match. |
@bitterbark troubleshooting seems to be headed in the right direction. The catalog builder likely uses the long name because there was no standard name in the header. Does that sound about right ? the other minor thing is that I see a grid_label.1 as a column. It is likely coming from the catalog builder yaml configuration file where grid_label is specified twice in the header. Just remove one of the occurrence. |
It looks like the standard_name is causing the problem. The catalog builder for CESM file only has access to its long_name, which is not the same as the GFDL/CMIP standard_names. So when the MDTF searches the catalog, despite finding a match for the variable name thanks to the translation files, it is failing because it is requiring that both the variable name and the standard_name match. Laeving off the standard_name still causes a failure because an empty string does not match the expected standard_name, either. I understand there is a desire for everything in the catalog to be perfect and complete but in this case, the variable_id is already present and matches what MDTF knows from the variable name translation files. So having to match the standard_name as well as the variable_name is redundant. The point of having the standard_name, I believe, is to deal with missing variable_ids so it should not be necessary here. It may seem desirable from MDTFs perspective to put the onus on the catalog building, but in order to make a catalog with a standard_name that matches the MDTF/CMIP requirements, CESM users are going to need a separate translation tool/lookup table for standard_name, adding duplicity and complexity that seem contrary to the design of the MDTF to handle variable names from different models. If we want to go that route, it makes sense to put all of the variable translation methods in the catalog builder and just say the MDTF only runs on CMIP variable names. At least that way we won't have two different kinds of information required in two different places. Alternatively, and potentially easier, could we keep the variable translation as is and query the catalog with the minimum necessary information: the variable_id & frequency (Adding case name/path if we are supporting a catalog for multiple run names). If there is agreement on this, I can put it in. Thanks for your opinions! Here is the testing process to prove this is the problem:
Testing a few simpler queries found that it works with just variable_id and frequency:
But adding the standard_name that MDTF is asking for (which doesn't match the long_name available in the CESM files) goes back to 0 results:
Note that leaving off the standard_name still causes a failure (I'm assuming the catalog search is not matching the empty string).
|
@bitterbark While the variable id and frequency may be sufficient to return the correct data from a small test catalog that only contains the subset of data for a few PODs in the same convention, the standard_name, whether defined in the file metadata or derived from one of the field table entries before/after the catalog is generated, is a minimum requirement for the query. If long_names are equivalent to standard_names in the CESM, meaning that they are defined the same way in each simulation and model version (e.g., not 'toa outgoing longwave radiation flux' in one dataset, and 'outgoing longwave radiation at the top of the atmosphere" in another) then they are viable alternatives for a CESM POD<->CESM data query. Variable IDs in the PODs may or may not match the equivalent model output variables even if the POD and data conventions match, and ids usually differ among the conventions (e.g., framework doesn't "know" that the CESM FLUT maps to CMIP rlut just because the frequencies match). Furthermore, if you were to query catalogs that include data from multiple simulations using just the variable id and frequency, the ESM datastore returned would be very large and we'd have to rely on information from the field tables to isolate the targeted files during preprocessing instead taking advantage of the intake-ESM APIs. This additional step would potentially reduce performance, especially if the files in the initial datastore had to be read into xarray to get the information. I understand that generating the CESM catalog has been difficult, but now that we have a better grasp of metadata in the CESM files, we can develop a method for the builders to populate necessary data from other sources. |
@wrongkindofdoctor Yeah, standard_name (long_name) can't be optional :( But this is also the only metadata that required us to open the netcdf files :/, hence the process is no longer completely light-weight to support MDTF use-case, especially if the variable names do not conform to a particular mapping. For GFDL, I made use of look up tables (e.g. https://github.com/NOAA-GFDL/MDTF-diagnostics/blob/main/data/gfdl-cmor-tables/gfdl_to_cmip5_vars.csv) (and this means our variable names are not all standardized too), and it is not a cure-all for all the GFDL simulations (hence we look at the header - standard_name or long_name as fall back), especially the new ones. While I agree standard_name is important, I am not convinced (as a user) that the catalog is an efficient and usable spot for the standard_name, though it's logical. Or, at least like @wrongkindofdoctor says, we need to collaboratively develop a standard way to populate this information in the catalogs. [1] Look up tables if they are not changing all the time for GFDL and CESM is an option, either MDTF or the builder can use it. We have hooks in catalog builder to use look up tables already.
Not sure how maintainable option 2 is, if it's not user configurable in the POD settings. The latter does not solve the performance of builders and relying on the header of the netcdf though. And.. I think the catalog building for CESM (with the catalogbuilder conda package) was not tedious, it could use more generality with contributions but has been functional for the purpose. It was straightforward surprisingly. Feedback @bitterbark? |
Thank you for the discussion so far. Please don't feel obligated to respond until you are back at work after the holidays! I would like to step back for a minute to look at the bigger picture.
For historical context, the previous versions of MDTF found variables by using the
But above, Jess said the 'framework doesn't "know" that the CESM FLUT maps to CMIP rlut'. Since this is exactly what the fieldlist_*jsonc files tell the framework, are these files no longer being used? In the case I'm currently running, the framework is looking for FLUT so somehow the name is being translated. (The failure occurs because it is also looking for standard_name, which is not being translated to CESM long_name). Am I missing some other place where this is happening? In previous versions, the fieldlist was the only place where variable names were translated, and this was sufficient, along with case/run name and frequency, to find data. Whether or not the fieldlist files are still being However, I also understand from the above that there are at least two complications to having a simple name translation table by convention:
Addressing (1), would having the convention as part of the catalog entry be sufficient for the framework to translate either type of name? I realize this query might have to happen in two stages (first convention, then subsetting by translated name). And (2), in CESM, neither variable name nor longname usually change between model runs although there are rare changes between major model versions. Would it work to have versioned convention codes? If the variable names change too often to make this feasible, it seems to me that the catalogs would be the right place for the translation, since presumably the person making the catalog would be more familiar with the model run than someone running the framework later. The deeper I get in this, the more I think it makes sense to have the conversions done in the catalog builder, likely by translating to a truly standard standard_name, and then the framework doesn't have to worry about translation at all. Since users have to build catalogs anyway, and we are going to supply the builder, this would suffice to maintain the original goal of the MDTF running on native GFDL and CESM files. Please feel free to educate me as to why this wouldn't work. Thanks again for all your work on this! |
@bitterbark I have added a custom cesm parser tested with the QBOi.EXP1.AMIP.001 dataset to the mdtf builder as an option. It combines information in the CESM fieldlists with metadata from the files using the ecgtools APIs to create and validate the intake-catalogs. I added missing data entries and defined standard_name attributes using cmip definitions that the mdtf catalog builder uses to populate the catalog. You can update your main branch and try it out by modifying the example template file, setting the convention to 'cesm'. You may need to add variables to the CESM fieldlist--use cmip standard_names if possible, otherwise use copy the long_name information to the standard_name. Note that this is not a replacement for the GFDL builder--just an alternative tool that Jacob and I can easily modify for limited testing with the MDTF-diagnostics package while the GFDL builder is refined to meet the requirements for multiple workflows. |
@bitterbark As for the where to do the standard_name conversions, the builder itself may be the appropriate place if we create a protocol for standard_names and other required metadata. @aradhakrishnanGFDL and I had a conversation about how to populate missing standard_name information for GFDL data, and I advised using CMIP equivalent names if available, then long_names if not. However, the long_name substitute isn't ideal for translation. Ideally, we would work with the each modeling group to define and document official standard_names (and other metadata) for model output variables that need it, but that is not a quick task, as I anticipate needing science board approval to invest the resources at GFDL. Alternatively, our workflow team, you, and other interested CESM devs/users involved with the MDTF can define whatever metadata we need for the translations, with an eye toward updating the CVS/field tables and catalogs if/when model developers create the the metadata themselves. |
Bug Severity
Describe the bug
I have made a catalog of CESM data using a modification of Aparna's CatalogBuilder
but the MDTF gives an error when I try to use it that seems to show variable names are not being translated. I don't know if this is supposed to happen in the catalog building process or the preprocessor.
Full output
Although it says no found in the json file, is in the file, which is CESM's rlut variable
Steps To Reproduce
I'm attaching the data catalog specification (json) b.e23_alpha16g.BLT1850.ne30_t232.082b.json
and catalog (csv) b.e23_alpha16g.BLT1850.ne30_t232.082b.csv
the input to the MDTF input_test.yml.txt
Environment
Describe the system environment:
Updated just now to latest on MDTF 5b0d16c
Log information and/or terminal output
(see above)
Thanks for any ideas!
The text was updated successfully, but these errors were encountered: