How can I get variables from different tables with 'require_all_on' #425

mickaellalande · 2020-07-23T12:32:15Z

mickaellalande
Jul 23, 2020

I would like to select only a few variables from 2 different tables (Amon and LImon), for example, tas, snc and pr and get all the models that have these 3 variables for these 2 tables. But when I do that I get 0 results, and I need to remove the table_id (see code bellow), but in that case, I get the variables for all different tables that I don't need.

query = dict(
    experiment_id=['historical'],
    variable_id=['tas', 'snc', 'pr'],
    table_id=['Amon', 'LImon'] # <-- problem
)

col_subset = col.search(require_all_on=["source_id"], **query)

col_subset.df.groupby("source_id")[
    ["experiment_id", "variable_id", "table_id", "member_id"]
].nunique()

Is there any way to do this?

Thanks in advance for your answers!

andersy005 · 2020-08-09T00:53:06Z

andersy005
Aug 9, 2020
Maintainer

@mickaellalande, I think the results you are getting are consistent with how require_all_on feature works. With your query in conjunction with require_all_on , intake-esm generates this combination of requirements:

{('historical', 'pr', 'Amon'),
 ('historical', 'pr', 'LImon'),
 ('historical', 'snc', 'Amon'),
 ('historical', 'snc', 'LImon'),
 ('historical', 'tas', 'Amon'),
 ('historical', 'tas', 'LImon')}

So, if a source_id doesn't have all six values in the above combination, it doesn't get included in the returned results.

To reproduce this issue, I trimmed your query to something small:

In [92]: query                                                                                                              
Out[92]: 
{'experiment_id': ['historical'],
 'variable_id': ['tas', 'snc', 'pr'],
 'table_id': ['Amon', 'LImon'],
 'source_id': ['BCC-CSM2-MR', 'SAM0-UNICON', 'NorESM2-LM']}
In [93]: cat = col.search(**query)

This query gives us this

In [112]: cat.df.groupby("source_id")["table_id"].unique()                                                                  
Out[112]: 
source_id
BCC-CSM2-MR    [Amon, LImon]
NorESM2-LM            [Amon]
SAM0-UNICON    [Amon, LImon]
Name: table_id, dtype: object

In [113]: cat.df.groupby("source_id")["variable_id"].unique()                                                               
Out[113]: 
source_id
BCC-CSM2-MR    [pr, tas, snc]
NorESM2-LM          [pr, tas]
SAM0-UNICON    [pr, tas, snc]
Name: variable_id, dtype: object

In [130]: cat.df.groupby("source_id")[ 
     ...:     ["experiment_id", "variable_id", "table_id", "member_id"] 
     ...: ].nunique()                                                                                                       
Out[130]: 
             experiment_id  variable_id  table_id  member_id
source_id                                                   
BCC-CSM2-MR              1            3         2          3
NorESM2-LM               1            2         1          3
SAM0-UNICON              1            3         2          1

Using the results from cell 130, one may be tempted to assume that with require_all_on="source_id, we should get entries from "BCC-CSM2-MR" and "SAM0-UNICON".

However, on close inspection, we can easily confirm that "BCC-CSM2-MR" and "SAM0-UNICON" don't meet all the requirements generated by intake-esm:

{('historical', 'pr', 'Amon'),
 ('historical', 'pr', 'LImon'),
 ('historical', 'snc', 'Amon'),
 ('historical', 'snc', 'LImon'),
 ('historical', 'tas', 'Amon'),
 ('historical', 'tas', 'LImon')}

For instance, "BCC-CSM2-MR" does not have assets with the following attributes ('historical', 'snc', 'Amon')

In [128]: Counter(cat.df[cat.df.source_id == "BCC-CSM2-MR"][["experiment_id", "variable_id", "table_id"]].set_index(["experi
     ...: ment_id", "variable_id", "table_id"]).index)                                                                      
Out[128]: 
Counter({('historical', 'pr', 'Amon'): 3,
         ('historical', 'tas', 'Amon'): 3,
         ('historical', 'snc', 'LImon'): 3})

In [129]: Counter(cat.df[cat.df.source_id ==  "SAM0-UNICON"][["experiment_id", "variable_id", "table_id"]].set_index(["exper
     ...: iment_id", "variable_id", "table_id"]).index)                                                                     
Out[129]: 
Counter({('historical', 'pr', 'Amon'): 1,
         ('historical', 'tas', 'Amon'): 1,
         ('historical', 'snc', 'LImon'): 1})

So, it looks like intake-esm is doing the right thing.

Is there any way to do this?

I am still looking into this, and will get back to you once I have a working solution

0 replies

mickaellalande · 2020-08-10T08:18:09Z

mickaellalande
Aug 10, 2020
Author

Thanks for your answer @andersy005, that's what I more or less understood about the query... Thanks for thinking about a solution for that! Some tables should be excluded for some variables I guess? Knowing that ('historical', 'snc', 'Amon') or ('historical', 'pr', 'LImon') doesn't make sense (snc is not an atmospheric variable and precipitation not terrestrial cryosphere ).

Maybe adding a frequency key would solve the problem (ex: frequency='monthly')? It makes me think that it looks like there is no period key also, that would be nice to just check if a period is available or select only a subperiod.

Just as a reminder, here are all the CMIP6 tables: http://clipc-services.ceda.ac.uk/dreq/index/miptable.html, and when we click on one table we get the variables associated.

0 replies

andersy005 · 2020-08-10T17:23:19Z

andersy005
Aug 10, 2020
Maintainer

Some tables should be excluded for some variables I guess? Knowing that ('historical', 'snc', 'Amon') or ('historical', 'pr', 'LImon') doesn't make sense (snc is not an atmospheric variable and precipitation not terrestrial cryosphere ).

Thanks for this explanation! I now understand why intake-esm was returning empty queries. Because intake-esm doesn't implement any special logic for CMIP6 and it's not aware of the fact that snc is a variable exclusively present in the LImon table and pr, tas are exclusively present in the Amon, the user needs to perform two separate queries and concatenate the returned dataframes into a single dataframe:

In [77]: q1 = dict(experiment_id=['historical'], variable_id=['tas', 'pr'], table_id='Amon')

In [78]: col_subset = col.search(**q1, require_all_on='source_id')

In [79]: col_subset
Out[79]: <pangeo-cmip6 catalog with 53 dataset(s) from 1054 asset(s)>

In [80]: q2 = dict(experiment_id=['historical'], variable_id=['snc'], table_id='LImon')

In [81]: col_subset_2 = col.search(**q2, require_all_on='source_id')

In [82]: col_subset_2
Out[82]: <pangeo-cmip6 catalog with 18 dataset(s) from 148 asset(s)>

In [83]: import pandas as pd

In [84]: col_subset.df = pd.concat([col_subset.df, col_subset_2.df])

In [85]: col_subset
Out[85]: <pangeo-cmip6 catalog with 71 dataset(s) from 1202 asset(s)>

In [87]: col_subset.unique(['variable_id', 'table_id'])
Out[87]:
{'variable_id': {'count': 3, 'values': ['pr', 'snc', 'tas']},
 'table_id': {'count': 2, 'values': ['Amon', 'LImon']}}

Does this accomplish what you want?

0 replies

mickaellalande · 2020-08-11T09:24:24Z

mickaellalande
Aug 11, 2020
Author

Humm I don't really understand the difference between this and my first post without require_all_on... because we still have the model NorESM2-LM where snc is not available. Maybe I missed something?

I actually worked around and find a potential solution:

import intake

url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
col = intake.open_esm_datastore(url)
col

# Same query as before
query = dict(
    experiment_id=['historical'],
    variable_id=['tas', 'snc', 'pr'],
    table_id=['Amon', 'LImon'],
    source_id=['BCC-CSM2-MR', 'SAM0-UNICON', 'NorESM2-LM']
)

# Remove require_all_on=["source_id"]
cat = col.search(**query)

# Here we still have the 3 models (but missing snc for NorESM2-LM)

# Filter the models that only have the 3 variables 
# https://stackoverflow.com/questions/48390150/pandas-select-rows-after-groupby
cat.df = cat.df.groupby("source_id").filter(
    lambda x : pd.Series(['pr', 'snc', 'tas']).isin(x['variable_id']).all()
)

# We finally get only the models with the 3 variables
cat.df.groupby("source_id")[
    ["experiment_id", "variable_id", "table_id", "member_id"]
].nunique()

dset_dict = cat.to_dataset_dict(
    zarr_kwargs={"consolidated": True, "decode_times": True}
)

ds = dset_dict[cat.keys()[0]]
ds

It's just a bit sad not to have only one dataset with the 3 variables, here there are still separated dset_dict[cat.keys()[0]] has tas and pr and dset_dict[cat.keys()[1]] has snc for the first model, etc.

But that's what I was looking for! I didn't know we can easily manipulate the queries outputs with usual pandas command (and I'm not too familiar with pandas)! It would be nice to have something that would do this automatically under the hood when we require multiple variables from different tables with require_all_on=["source_id"] activated with an option to say that we want all variables. I was actually intuitively trying require_all_on=["variable_id"] but it didn't return anything too.

0 replies

andersy005 · 2020-08-11T17:23:41Z

andersy005
Aug 11, 2020
Maintainer

Thank you for the updates, @mickaellalande!

It would be nice to have something that would do this automatically under the hood when we require multiple variables from different tables with require_all_on=["source_id"] activated with an option to say that we want all variables.

If I understand this correctly, you are suggesting that intake-esm should be able to make connections between source_id and variable_id in this case, right?

In the light of this example you provided, I am going to revisit the logic used when require_all_on is enabled. The existing implementation may have some gaps, and if I find any bugs, I will fix them.

0 replies

wachsylon · 2021-10-20T16:10:13Z

wachsylon
Oct 20, 2021

This would be a nice enhancement for all CMIP-like projects where only the combination of variable_id and table_id is unique.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I get variables from different tables with 'require_all_on' #425

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How can I get variables from different tables with 'require_all_on' #425

mickaellalande Jul 23, 2020

Replies: 6 comments

andersy005 Aug 9, 2020 Maintainer

mickaellalande Aug 10, 2020 Author

andersy005 Aug 10, 2020 Maintainer

mickaellalande Aug 11, 2020 Author

andersy005 Aug 11, 2020 Maintainer

wachsylon Oct 20, 2021

mickaellalande
Jul 23, 2020

andersy005
Aug 9, 2020
Maintainer

mickaellalande
Aug 10, 2020
Author

andersy005
Aug 10, 2020
Maintainer

mickaellalande
Aug 11, 2020
Author

andersy005
Aug 11, 2020
Maintainer

wachsylon
Oct 20, 2021