Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using CESM variable names with data catalog #719

Open
1 task done
bitterbark opened this issue Dec 10, 2024 · 15 comments
Open
1 task done

Error using CESM variable names with data catalog #719

bitterbark opened this issue Dec 10, 2024 · 15 comments
Assignees
Labels
data catalogs Issues related to intake esm data catalogs

Comments

@bitterbark
Copy link
Collaborator

Bug Severity

  • 2 = Major problem that affects overall functionality, but that does not occur for all users (e.g., problems installing the framework with a specific Conda version, a framework option that causes one or more PODs to fail, or missing/incompatible Python modules).

Describe the bug
I have made a catalog of CESM data using a modification of Aparna's CatalogBuilder
but the MDTF gives an error when I try to use it that seems to show variable names are not being translated. I don't know if this is supposed to happen in the catalog building process or the preprocessor.

File "/glade/u/home/bundy/mdtf/MDTF_current/MDTF-diagnostics.main/src/preprocessor.py", line 1080, in query_catalog
raise util.DataRequestError(
src.util.exceptions.DataRequestError: Unable to find match or alternate for rlut for case b.e23_alpha16g.BLT1850.ne30_t232.082b in /glade/u/home/bundy/mdtf/catalogs/b.e23_alpha16g.BLT1850.ne30_t232.082b.json

Full output
Although it says no found in the json file, is in the file, which is CESM's rlut variable

Steps To Reproduce
I'm attaching the data catalog specification (json) b.e23_alpha16g.BLT1850.ne30_t232.082b.json
and catalog (csv) b.e23_alpha16g.BLT1850.ne30_t232.082b.csv
the input to the MDTF input_test.yml.txt

Environment
Describe the system environment:

  • branch name and link: [e.g., develop, https://github.com/NOAA-GFDL/MDTF-diagnostics/tree/develop]
    Updated just now to latest on MDTF 5b0d16c

Log information and/or terminal output
(see above)

Thanks for any ideas!

@wrongkindofdoctor wrongkindofdoctor added the data catalogs Issues related to intake esm data catalogs label Dec 11, 2024
@aradhakrishnanGFDL
Copy link
Collaborator

Hi @bitterbark in my example CESM catalog generation, the standard_name was filled out. Your csv does not have it. Can you double check your config yaml header list and the call to the catalog builder if it looks like what I passed along to you?

Ref NOAA-GFDL/CatalogBuilder#89

@wrongkindofdoctor, Could standard_name missing be the root cause of the MDTF errors @bitterbark is seeing?

@bitterbark
Copy link
Collaborator Author

It makes sense that missing standard_name would cause the problem. However, I'm not sure where it would get set.

I see standard_name in the config yaml header list

headerlist: ["activity_id", "institution_id", "source_id", "experiment_id",
"frequency", "realm", "table_id",
"member_id", "grid_label", "variable_id",
"time_range", "chunk_freq","grid_label","platform","dimensions","cell_methods","standard_name","path"]

As for the call to the catalog builder, do you mean this?

csv, json = gen_intake_gfdl.create_catalog(input_path=input_path,output_path=output_path,config=configyaml)
return(csv,json)

I haven't changed anything other than in catalogbuilder/intakebuilder/dat/gfdlcmipfreq.yaml

+h2:

  • frequency: day

And using the cesm-template.yaml file you sent.
I'm not sure where else to look.

@aradhakrishnanGFDL
Copy link
Collaborator

@bitterbark the slow==True option is missing in your call. This is important for the builder to look for the standard name or long name in the header of the netcdf.

csv, json = gen_intake_gfdl.create_catalog(input_path=input_path,output_path=output_path,verbose=True,config=configyaml,slow=True)

@bitterbark
Copy link
Collaborator Author

Thanks, looks like we're a step closer. The csv file now has the standard_name in it:

,,cam,b.e23_alpha16g.BLT1850.ne30_t232.082b,mon,atmos,,,,FLUT,002401-002912,,,,,ts,Upwelling_longwave_flux_at_top_of_model,/glade/campaign/cgd/amp/bundy/b.e23_alpha16g.BLT1850.ne30_t232.082b/20240314bl/ts/b.e23_alpha16g.BLT1850.ne30_t232.082b.cam.h0.FLUT.002401-002912.nc

Unfortunately the MDTF error remains the same. Is the standard_name supposed to be enough?

@bitterbark
Copy link
Collaborator Author

@bitterbark
Copy link
Collaborator Author

One aspect of the problem: src/preprocessor.py:query_catalog()
seems to expect a different standard name than is being provided in the CESM file

rlut standard_name='toa_outgoing_longwave_flux'

Vs. the catalog is reading the long_name from the variable and assigning standard name in the catalog based on that:

Upwelling_longwave_flux_at_top_of_model

Is there something in place here that replaces the old variable-name translation files that we would have just specified which variable is which?

@bitterbark
Copy link
Collaborator Author

The name translation seems to be happening okay:
L1019: rlut standard_name='toa_outgoing_longwave_flux' var.translation.convention='CESM'
L1023: after translation FLUT standard_name='toa_outgoing_longwave_flux'

@bitterbark
Copy link
Collaborator Author

bitterbark commented Dec 17, 2024

The catalog looks like it loaded fine, this is it printed as the MDTF runs.
I'm still trying to figure out why the search doesn't return anything.

Update: if I query the catalog with just {"variable_id": "FLUT"} it matches the two pertinent records.
It looks like the mismatch in the standard names might be causing the problem, that's what I see to be different.

case_d.query={
'frequency': 'day',
'path': [re.compile('(b.e23_alpha16g.BLT1850.ne30_t232.082b)')],
'standard_name': 'toa_outgoing_longwave_flux',
'realm': 'atmos*',
'variable_id': 'FLUT'}

Whereas if you look at the catalog (html link at top of this post), the standard_name in the catalog is Upwelling_longwave_flux_at_top_of_model (from the long_name in the file)

However, if I take standard_name out of the query, it still doesn't match.

@aradhakrishnanGFDL
Copy link
Collaborator

@bitterbark troubleshooting seems to be headed in the right direction. The catalog builder likely uses the long name because there was no standard name in the header. Does that sound about right ?

the other minor thing is that I see a grid_label.1 as a column. It is likely coming from the catalog builder yaml configuration file where grid_label is specified twice in the header. Just remove one of the occurrence.

@bitterbark
Copy link
Collaborator Author

It looks like the standard_name is causing the problem. The catalog builder for CESM file only has access to its long_name, which is not the same as the GFDL/CMIP standard_names. So when the MDTF searches the catalog, despite finding a match for the variable name thanks to the translation files, it is failing because it is requiring that both the variable name and the standard_name match. Laeving off the standard_name still causes a failure because an empty string does not match the expected standard_name, either.

I understand there is a desire for everything in the catalog to be perfect and complete but in this case, the variable_id is already present and matches what MDTF knows from the variable name translation files. So having to match the standard_name as well as the variable_name is redundant. The point of having the standard_name, I believe, is to deal with missing variable_ids so it should not be necessary here. It may seem desirable from MDTFs perspective to put the onus on the catalog building, but in order to make a catalog with a standard_name that matches the MDTF/CMIP requirements, CESM users are going to need a separate translation tool/lookup table for standard_name, adding duplicity and complexity that seem contrary to the design of the MDTF to handle variable names from different models.

If we want to go that route, it makes sense to put all of the variable translation methods in the catalog builder and just say the MDTF only runs on CMIP variable names. At least that way we won't have two different kinds of information required in two different places.

Alternatively, and potentially easier, could we keep the variable translation as is and query the catalog with the minimum necessary information: the variable_id & frequency (Adding case name/path if we are supporting a catalog for multiple run names).
Then if no files are found, we could use the standard_name as a backup method, like the alternate names are currently used.

If there is agreement on this, I can put it in.

Thanks for your opinions!

Here is the testing process to prove this is the problem:
in src/preprocessor.py, the query of the catalog looks like this:

case_d.query={'frequency': 'day', 'path': [re.compile('(b.e23_alpha16g.BLT1850.ne30_t232.082b)')], 'standard_name'
: 'toa_outgoing_longwave_flux', 'realm': 'atmos*', 'variable_id': 'FLUT'}
<esm_catalog_ESM4 catalog with 0 dataset(s) from 0 asset(s)

Testing a few simpler queries found that it works with just variable_id and frequency:

Results from query temp_query={'variable_id': 'FLUT', 'frequency': 'day'}
<esm_catalog_ESM4 catalog with 1 dataset(s) from 1 asset(s)\

But adding the standard_name that MDTF is asking for (which doesn't match the long_name available in the CESM files) goes back to 0 results:

Results from query temp_query={'variable_id': 'FLUT', 'frequency': 'day', 'standard_name': 'toa_outgoing_longwave_flux'} > <esm_catalog_ESM4 catalog with 0 dataset(s) from 0 asset(s)>

Note that leaving off the standard_name still causes a failure (I'm assuming the catalog search is not matching the empty string).

Results from query temp_query={'variable_id': 'FLUT', 'frequency': 'day', 'standard_name': ''}
<esm_catalog_ESM4 catalog with 0 dataset(s) from 0 asset(s)>

@wrongkindofdoctor
Copy link
Collaborator

wrongkindofdoctor commented Dec 20, 2024

@bitterbark While the variable id and frequency may be sufficient to return the correct data from a small test catalog that only contains the subset of data for a few PODs in the same convention, the standard_name, whether defined in the file metadata or derived from one of the field table entries before/after the catalog is generated, is a minimum requirement for the query. If long_names are equivalent to standard_names in the CESM, meaning that they are defined the same way in each simulation and model version (e.g., not 'toa outgoing longwave radiation flux' in one dataset, and 'outgoing longwave radiation at the top of the atmosphere" in another) then they are viable alternatives for a CESM POD<->CESM data query.

Variable IDs in the PODs may or may not match the equivalent model output variables even if the POD and data conventions match, and ids usually differ among the conventions (e.g., framework doesn't "know" that the CESM FLUT maps to CMIP rlut just because the frequencies match).

Furthermore, if you were to query catalogs that include data from multiple simulations using just the variable id and frequency, the ESM datastore returned would be very large and we'd have to rely on information from the field tables to isolate the targeted files during preprocessing instead taking advantage of the intake-ESM APIs. This additional step would potentially reduce performance, especially if the files in the initial datastore had to be read into xarray to get the information.

I understand that generating the CESM catalog has been difficult, but now that we have a better grasp of metadata in the CESM files, we can develop a method for the builders to populate necessary data from other sources.

@aradhakrishnanGFDL
Copy link
Collaborator

@wrongkindofdoctor Yeah, standard_name (long_name) can't be optional :( But this is also the only metadata that required us to open the netcdf files :/, hence the process is no longer completely light-weight to support MDTF use-case, especially if the variable names do not conform to a particular mapping.

For GFDL, I made use of look up tables (e.g. https://github.com/NOAA-GFDL/MDTF-diagnostics/blob/main/data/gfdl-cmor-tables/gfdl_to_cmip5_vars.csv) (and this means our variable names are not all standardized too), and it is not a cure-all for all the GFDL simulations (hence we look at the header - standard_name or long_name as fall back), especially the new ones. While I agree standard_name is important, I am not convinced (as a user) that the catalog is an efficient and usable spot for the standard_name, though it's logical. Or, at least like @wrongkindofdoctor says, we need to collaboratively develop a standard way to populate this information in the catalogs.

[1] Look up tables if they are not changing all the time for GFDL and CESM is an option, either MDTF or the builder can use it. We have hooks in catalog builder to use look up tables already.
[2] Stick to standard name (fall back: long name space->_) in header and ensure MDTF recognizes this for GFDL and CESM simulations.

  • assuming the long_name in the CESM data for a particular POD variable does not evolve.

Not sure how maintainable option 2 is, if it's not user configurable in the POD settings.

The latter does not solve the performance of builders and relying on the header of the netcdf though.

And.. I think the catalog building for CESM (with the catalogbuilder conda package) was not tedious, it could use more generality with contributions but has been functional for the purpose. It was straightforward surprisingly. Feedback @bitterbark?

@bitterbark
Copy link
Collaborator Author

Thank you for the discussion so far. Please don't feel obligated to respond until you are back at work after the holidays!

I would like to step back for a minute to look at the bigger picture.
These are the ideals I think we should be working toward:

  • have one place where all variable name types of conversion happen, either in the framework or in the catalog builder but not both
  • the framework should be able to find data from a catalog given a minimum set of required data,
    using extra information only if it fails with that

For historical context, the previous versions of MDTF found variables by using the
fieldlist files that I have referred to as lookup or conversion tables, eg: MDTF-diagnostics.main/data/fieldlist_CESM.jsonc
It looks to me like these are still being used in MDTFv3:

src/translation.py:
class VariableTranslator(metaclass=util.Singleton):
... naming conventions... are defined in the data/fieldlist_*.jsonc

But above, Jess said the 'framework doesn't "know" that the CESM FLUT maps to CMIP rlut'. Since this is exactly what the fieldlist_*jsonc files tell the framework, are these files no longer being used? In the case I'm currently running, the framework is looking for FLUT so somehow the name is being translated. (The failure occurs because it is also looking for standard_name, which is not being translated to CESM long_name). Am I missing some other place where this is happening?

In previous versions, the fieldlist was the only place where variable names were translated, and this was sufficient, along with case/run name and frequency, to find data. Whether or not the fieldlist files are still being
used, I understand you are saying that variable name translation is no longer sufficient. Now it sounds like standard_name is preferred. So that would make me think we should just be translating standard_name instead of variable name in any framework files.

However, I also understand from the above that there are at least two complications to having a simple name translation table by convention:

  1. in a catalog with data from more than one convention, a simple translation, whether variable name or standard_name, is not enough
  2. the variable names and/or standard_names might change between different model runs within the same convention

Addressing (1), would having the convention as part of the catalog entry be sufficient for the framework to translate either type of name? I realize this query might have to happen in two stages (first convention, then subsetting by translated name).

And (2), in CESM, neither variable name nor longname usually change between model runs although there are rare changes between major model versions. Would it work to have versioned convention codes? If the variable names change too often to make this feasible, it seems to me that the catalogs would be the right place for the translation, since presumably the person making the catalog would be more familiar with the model run than someone running the framework later.

The deeper I get in this, the more I think it makes sense to have the conversions done in the catalog builder, likely by translating to a truly standard standard_name, and then the framework doesn't have to worry about translation at all. Since users have to build catalogs anyway, and we are going to supply the builder, this would suffice to maintain the original goal of the MDTF running on native GFDL and CESM files. Please feel free to educate me as to why this wouldn't work.

Thanks again for all your work on this!

@wrongkindofdoctor
Copy link
Collaborator

wrongkindofdoctor commented Dec 27, 2024

@bitterbark I have added a custom cesm parser tested with the QBOi.EXP1.AMIP.001 dataset to the mdtf builder as an option. It combines information in the CESM fieldlists with metadata from the files using the ecgtools APIs to create and validate the intake-catalogs. I added missing data entries and defined standard_name attributes using cmip definitions that the mdtf catalog builder uses to populate the catalog. You can update your main branch and try it out by modifying the example template file, setting the convention to 'cesm'. You may need to add variables to the CESM fieldlist--use cmip standard_names if possible, otherwise use copy the long_name information to the standard_name.

Note that this is not a replacement for the GFDL builder--just an alternative tool that Jacob and I can easily modify for limited testing with the MDTF-diagnostics package while the GFDL builder is refined to meet the requirements for multiple workflows.

@wrongkindofdoctor
Copy link
Collaborator

@bitterbark As for the where to do the standard_name conversions, the builder itself may be the appropriate place if we create a protocol for standard_names and other required metadata. @aradhakrishnanGFDL and I had a conversation about how to populate missing standard_name information for GFDL data, and I advised using CMIP equivalent names if available, then long_names if not. However, the long_name substitute isn't ideal for translation. Ideally, we would work with the each modeling group to define and document official standard_names (and other metadata) for model output variables that need it, but that is not a quick task, as I anticipate needing science board approval to invest the resources at GFDL.

Alternatively, our workflow team, you, and other interested CESM devs/users involved with the MDTF can define whatever metadata we need for the translations, with an eye toward updating the CVS/field tables and catalogs if/when model developers create the the metadata themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data catalogs Issues related to intake esm data catalogs
Projects
None yet
Development

No branches or pull requests

3 participants