How can we make intake-esm more transparent? #531

rabernat · 2019-10-18T14:14:32Z

rabernat
Oct 18, 2019

I'm sitting with @naomi-henderson, and we are discussing how we might make intake-esm more transparent about what it's doing under the hood.

It would be nice if there were a mode where, rather than running the all the merge operations, intake returns a nested dictionary similar to the one I showed in my recursive merge demo

{'X': {'A': {'1': ds1, '2': ds2}, 'B': {'1': ds3, '2': ds4}},
 'Y': {'A': {'1': ds5, '2': ds6}, 'B': {'1': ds7', '2': ds8}}}

This would allow users to manually descend into the different individual datasets and examine them one a time, optionally applying their own merge logic.

This should be relatively easy, since intake-esm probably has an internal data structure like this already.

matt-long · 2019-10-18T14:47:15Z

matt-long
Oct 18, 2019

It should be relatively easy to return the nested dictionary.

A couple other ideas include enabling an aggregate=False option, which would return each of the individual datasets and a get_keys() method that would just return the keys that are build by the to_dataset_dict method.

0 replies

rabernat · 2019-10-18T14:49:31Z

rabernat
Oct 18, 2019
Author

enabling an aggregate=False option

👍

0 replies

rabernat · 2019-10-18T14:52:05Z

rabernat
Oct 18, 2019
Author

an aggregate=False option

More thoughts: how would this work? Would what would the keys be? Would it just group by all columns?

0 replies

matt-long · 2019-10-18T14:55:16Z

matt-long
Oct 18, 2019

It would return a dataset for each row in the database. We could form keys from the groupby applied to all columns, but maybe it would be more accessible if the key was just the index. What do you think?

0 replies

rabernat · 2019-10-18T16:07:06Z

rabernat
Oct 18, 2019
Author

What would intake-esm currently do if there were no aggregation_control entry in the collection description?

0 replies

rabernat · 2019-10-18T16:08:56Z

rabernat
Oct 18, 2019
Author

Answer:

Raise KeyError: 'aggregation_control'

That is NOT the right behavior. Aggregation should be totally 100% optional in these catalogs.

0 replies

matt-long · 2019-10-18T16:17:52Z

matt-long
Oct 18, 2019

Agreed, that's a bug, but easy to fix. Without aggregation_control the code forms groups over all columns:

groups = self.df.groupby(self.df.columns.tolist())

and the returned keys will be of the same format. We can trigger the same behavior if aggregate=False.

0 replies

andersy005 · 2019-10-18T19:04:06Z

andersy005
Oct 18, 2019
Maintainer

@naomi-henderson, @rabernat,

With #164 the following works:

import intake
col_file = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"

col = intake.open_esm_datastore(col_file)
query = dict(experiment_id='historical', table_id='Oyr', 
                 variable_id='o2', grid_label='gn', member_id='r1i1p1f1')
cat = col.search(**query)



# Disable aggregations
dsets_pp = cat.to_dataset_dict(aggregate=False)
print(dsets_pp.keys())

--> The keys in the returned dictionary of datasets are constructed as follows:
	'zstore'

--> There will be 2 group(s)

dict_keys(['gs://cmip6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/Oyr/o2/gn/', 'gs://cmip6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Oyr/o2/gn/'])

0 replies

rabernat · 2019-10-18T19:24:55Z

rabernat
Oct 18, 2019
Author

@andersy005 - nice! However, I would prefer for the keys to be the groups, not the paths, as @matt-long suggested.

Are the keys the datasets themselves?

0 replies

andersy005 · 2019-10-19T18:51:10Z

andersy005
Oct 19, 2019
Maintainer

Assuming that we have a row with the following attributes:

activity_id                                              AerChemMIP
institution_id                                                  BCC
source_id                                                  BCC-ESM1
experiment_id                                                ssp370
member_id                                                  r1i1p1f1
table_id                                                       Amon
variable_id                                                      pr
grid_label                                                       gn
zstore            gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...
dcpp_init_year                                                  NaN
Name: 0, dtype: object

I would prefer for the keys to be the groups, not the paths, as @matt-long suggested.

Should we have something along these lines?

{ 'AerChemMIP.BCC.BCC-ESM1.ssp370.r1i1p1f1.Amon.pr.gn.NaN' : 
  <xarray.Dataset>
Dimensions:    (bnds: 2, lat: 64, lon: 128, time: 492)
Coordinates:
  * lat        (lat) float64 -87.86 -85.1 -82.31 -79.53 ... 82.31 85.1 87.86
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(64, 2), meta=np.ndarray>
  * lon        (lon) float64 0.0 2.812 5.625 8.438 ... 348.8 351.6 354.4 357.2
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(128, 2), meta=np.ndarray>
  * time       (time) object 2015-01-16 12:00:00 ... 2055-12-16 12:00:00
    time_bnds  (time, bnds) object dask.array<chunksize=(492, 2), meta=np.ndarray>
Dimensions without coordinates: bnds
Data variables:
    pr         (time, lat, lon) float32 dask.array<chunksize=(492, 64, 128), meta=np.ndarray>
Attributes:
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            AerChemMIP
    further_info_url:       https://furtherinfo.es-doc.org/CMIP6.BCC.BCC-ESM1...
    grid:                   T42

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can we make intake-esm more transparent? #531

{{title}}

Replies: 10 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How can we make intake-esm more transparent? #531

rabernat Oct 18, 2019

Replies: 10 comments

matt-long Oct 18, 2019

rabernat Oct 18, 2019 Author

rabernat Oct 18, 2019 Author

matt-long Oct 18, 2019

rabernat Oct 18, 2019 Author

rabernat Oct 18, 2019 Author

matt-long Oct 18, 2019

andersy005 Oct 18, 2019 Maintainer

rabernat Oct 18, 2019 Author

andersy005 Oct 19, 2019 Maintainer

rabernat
Oct 18, 2019

matt-long
Oct 18, 2019

rabernat
Oct 18, 2019
Author

rabernat
Oct 18, 2019
Author

matt-long
Oct 18, 2019

rabernat
Oct 18, 2019
Author

rabernat
Oct 18, 2019
Author

matt-long
Oct 18, 2019

andersy005
Oct 18, 2019
Maintainer

rabernat
Oct 18, 2019
Author

andersy005
Oct 19, 2019
Maintainer