From 81ada937824dcd88f76c8106bd4e7d22f69b6ac7 Mon Sep 17 00:00:00 2001
From: Nathan Collier <nathaniel.collier@gmail.com>
Date: Wed, 6 Dec 2023 15:53:32 -0500
Subject: [PATCH] first draft

---
 notebooks/complex-search.ipynb | 544 +++++++++++++++++++++++++++++++++
 1 file changed, 544 insertions(+)
 create mode 100644 notebooks/complex-search.ipynb
diff --git a/notebooks/complex-search.ipynb b/notebooks/complex-search.ipynb
new file mode 100644
index 0000000..fb1d70b
--- /dev/null
+++ b/notebooks/complex-search.ipynb
@@ -0,0 +1,544 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img src=\"images/logos/esgf2-us.png\" width=250 alt=\"ESGF logo\"></img>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Complex Searching with `intake-esgf`\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "In this tutorial we will present an interface under design to facilitate complex searching using [intake-esgf](https://github.com/esgf2-us/intake-esgf). `intake-esgf` is a small `intake` and `intake-esm` *inspired* package under development in ESGF2. Please note that there is a name collison with an existing package in PyPI and conda. You will need to install the package from [source](https://github.com/esgf2-us/intake-esgf).\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "| Concepts | Importance | Notes |\n",
+    "| --- | --- | --- |\n",
+    "| [Install Package](https://github.com/esgf2-us/intake-esgf) | Necessary | |\n",
+    "| [Understanding of NetCDF](https://foundations.projectpythia.org/core/data-formats/netcdf-cf.html) | Helpful | Familiarity with metadata structure |\n",
+    "| Familiar with [intake-esm](https://intake-esm.readthedocs.io/en/stable/) | Helpful | Similar interface |\n",
+    "| [Transient climate response](https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2008JD010405) | Background | |\n",
+    "- **Time to learn**: 30 minutes\n",
+    "\n",
+    "## Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from intake_esgf import ESGFCatalog"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Initializing the Catalog\n",
+    "\n",
+    "As with `intake-esm` we first instantiate the catalog. However, since we will populate the catalog with search results, the catalog starts empty. Internally, we query different ESGF index nodes for information about what datasets you wish to include in your analysis. As ESGF2 is actively working on an index redesign, our catlogs by default point to a Globus (ElasticSearch) based index at ALCF (Argonne Leadership Computing Facility)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Perform a search() to populate the catalog.\n",
+      "GlobusESGFIndex('anl-dev')\n"
+     ]
+    }
+   ],
+   "source": [
+    "cat = ESGFCatalog()\n",
+    "print(cat)\n",
+    "for ind in cat.indices: # Which indices are included?\n",
+    "    print(ind)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We also provide support for connecting to the ESGF1 Solr-based indices. You may specify a server or list or just include `True` to choose all the federated index nodes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "GlobusESGFIndex('anl-dev')\n",
+      "SolrESGFIndex('esgf.ceda.ac.uk')\n",
+      "SolrESGFIndex('esgf-data.dkrz.de')\n",
+      "SolrESGFIndex('esgf-node.ipsl.upmc.fr')\n",
+      "SolrESGFIndex('esg-dn1.nsc.liu.se')\n",
+      "SolrESGFIndex('esgf-node.llnl.gov')\n",
+      "SolrESGFIndex('esgf.nci.org.au')\n",
+      "SolrESGFIndex('esgf-node.ornl.gov')\n"
+     ]
+    }
+   ],
+   "source": [
+    "cat = ESGFCatalog(esgf1_indices=\"esgf-node.llnl.gov\")  # include LLNL\n",
+    "cat = ESGFCatalog(esgf1_indices=[\"esgf-node.ornl.gov\", \"esgf.ceda.ac.uk\"])  # ORNL & CEDA\n",
+    "cat = ESGFCatalog(esgf1_indices=True)  # all federated indices\n",
+    "for ind in cat.indices:\n",
+    "    print(ind)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Populate the catalog\n",
+    "\n",
+    "Many times, an analysis will require several variables across multiple experiments. For example, if one were to compute the transient climate response (TCRE), you would need tempererature (`tas`) and carbon emissions from land (`nbp`) and ocean (`fgco2`) for a 1% CO2 increase experiment (`1pctCO2`) as well as the control experiment (`piControl`). If TCRE is not in your particular science, that is ok for this notebook. It is a motivating example and the specifics are less important than the search concepts. First, we perform a search in a familiar syntax."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "   Searching indices: 100%|███████████████████████████████|8/8 [    1.85s/index]\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Summary information for 399 results:\n",
+      "mip_era                                                     [CMIP6]\n",
+      "activity_id                                                  [CMIP]\n",
+      "institution_id    [MOHC, MRI, MPI-M, NCAR, NOAA-GFDL, NCC, NIMS-...\n",
+      "source_id         [UKESM1-0-LL, MRI-ESM2-0, MPI-ESM1-2-LR, CESM2...\n",
+      "experiment_id                                  [piControl, 1pctCO2]\n",
+      "member_id         [r1i1p1f2, r1i2p1f1, r1i1p1f1, r2i1p1f1, r3i1p...\n",
+      "table_id                                         [Lmon, Omon, Amon]\n",
+      "variable_id                                       [nbp, fgco2, tas]\n",
+      "grid_label                                            [gn, gr1, gr]\n",
+      "dtype: object\n"
+     ]
+    }
+   ],
+   "source": [
+    "cat.search(\n",
+    "    experiment_id=[\"piControl\", \"1pctCO2\"],\n",
+    "    variable_id=[\"tas\", \"fgco2\", \"nbp\"],\n",
+    "    table_id=[\"Amon\", \"Omon\", \"Lmon\"],\n",
+    ")\n",
+    "print(cat)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Internally, this launches simultaneous searches that are combined locally to provide a global view of what datasets are available. While the Solr indices themselves can be searched in distributed fashion, they will not report if an index has failed to return a response. As index nodes go down from time to time, this can leave you with a false impression that you have found all the datasets of interest. By managing the searches locally, `intake-esgf` can report back to you that an index has failed and that your results may be incomplete.\n",
+    "\n",
+    "If you would like details about what `intake-esgf` is doing, look in the local cache directory (`${HOME}/.esgf/`) for a `esgf.log` file. This is a full history of everything that `intake-esgf` has searched, downloaded, or accessed. You can also look at just this session by calling `session_log()`. In this case you will see how long each index took to return a response and if any failed"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[36;20m2023-12-06 15:51:39 \u001b[0m\u001b[36;32msearch begin\u001b[0m experiment_id=['piControl', '1pctCO2'], variable_id=['tas', 'fgco2', 'nbp'], table_id=['Amon', 'Omon', 'Lmon']\n",
+      "\u001b[36;20m2023-12-06 15:51:40 \u001b[0m└─GlobusESGFIndex('anl-dev') results=329 response_time=1.33 total_time=1.34\n",
+      "\u001b[36;20m2023-12-06 15:51:41 \u001b[0m└─SolrESGFIndex('esgf-node.ipsl.upmc.fr') response_time=1.39 total_time=1.86\n",
+      "\u001b[36;20m2023-12-06 15:51:41 \u001b[0m└─SolrESGFIndex('esg-dn1.nsc.liu.se') response_time=1.68 total_time=2.28\n",
+      "\u001b[36;20m2023-12-06 15:51:44 \u001b[0m└─SolrESGFIndex('esgf.ceda.ac.uk') response_time=1.30 total_time=5.06\n",
+      "\u001b[36;20m2023-12-06 15:51:45 \u001b[0m└─SolrESGFIndex('esgf.nci.org.au') response_time=3.42 total_time=6.59\n",
+      "\u001b[36;20m2023-12-06 15:51:46 \u001b[0m└─SolrESGFIndex('esgf-node.ornl.gov') response_time=0.78 total_time=6.76\n",
+      "\u001b[36;20m2023-12-06 15:51:46 \u001b[0m└─SolrESGFIndex('esgf-data.dkrz.de') response_time=1.89 total_time=7.29\n",
+      "\u001b[36;20m2023-12-06 15:51:54 \u001b[0m└─SolrESGFIndex('esgf-node.llnl.gov') response_time=1.83 total_time=14.77\n",
+      "\u001b[36;20m2023-12-06 15:51:54 \u001b[0m\u001b[36;32msearch end\u001b[0m total_time=15.26\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(cat.session_log())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "At this stage of the search you have a catalog full of possibly relevant datasets for your analysis, stored in a `pandas` dataframe. You are free to view and manipulate this dataframe to help hone these results down. It is available to you as the `df` member of the `ESGFCatalog`. You should be careful to only remove rows as internally we could use any column in the downloading of the data. Also note that we have removed the user-facing notion of *where* the data is hosted. The `id` column of this dataframe is a list of full `dataset_ids` which includes the location information. At the point when you are ready to download data, we will choose locations automatically that are fastest for you."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>mip_era</th>\n",
+       "      <th>activity_id</th>\n",
+       "      <th>institution_id</th>\n",
+       "      <th>source_id</th>\n",
+       "      <th>experiment_id</th>\n",
+       "      <th>member_id</th>\n",
+       "      <th>table_id</th>\n",
+       "      <th>variable_id</th>\n",
+       "      <th>grid_label</th>\n",
+       "      <th>version</th>\n",
+       "      <th>id</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>CMIP6</td>\n",
+       "      <td>CMIP</td>\n",
+       "      <td>MOHC</td>\n",
+       "      <td>UKESM1-0-LL</td>\n",
+       "      <td>piControl</td>\n",
+       "      <td>r1i1p1f2</td>\n",
+       "      <td>Lmon</td>\n",
+       "      <td>nbp</td>\n",
+       "      <td>gn</td>\n",
+       "      <td>v20200828</td>\n",
+       "      <td>[CMIP6.CMIP.MOHC.UKESM1-0-LL.piControl.r1i1p1f...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>CMIP6</td>\n",
+       "      <td>CMIP</td>\n",
+       "      <td>MRI</td>\n",
+       "      <td>MRI-ESM2-0</td>\n",
+       "      <td>1pctCO2</td>\n",
+       "      <td>r1i2p1f1</td>\n",
+       "      <td>Omon</td>\n",
+       "      <td>fgco2</td>\n",
+       "      <td>gn</td>\n",
+       "      <td>v20210311</td>\n",
+       "      <td>[CMIP6.CMIP.MRI.MRI-ESM2-0.1pctCO2.r1i2p1f1.Om...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>CMIP6</td>\n",
+       "      <td>CMIP</td>\n",
+       "      <td>MPI-M</td>\n",
+       "      <td>MPI-ESM1-2-LR</td>\n",
+       "      <td>piControl</td>\n",
+       "      <td>r1i1p1f1</td>\n",
+       "      <td>Omon</td>\n",
+       "      <td>fgco2</td>\n",
+       "      <td>gn</td>\n",
+       "      <td>v20190710</td>\n",
+       "      <td>[CMIP6.CMIP.MPI-M.MPI-ESM1-2-LR.piControl.r1i1...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>CMIP6</td>\n",
+       "      <td>CMIP</td>\n",
+       "      <td>NCAR</td>\n",
+       "      <td>CESM2-FV2</td>\n",
+       "      <td>piControl</td>\n",
+       "      <td>r1i1p1f1</td>\n",
+       "      <td>Amon</td>\n",
+       "      <td>tas</td>\n",
+       "      <td>gn</td>\n",
+       "      <td>v20191120</td>\n",
+       "      <td>[CMIP6.CMIP.NCAR.CESM2-FV2.piControl.r1i1p1f1....</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>CMIP6</td>\n",
+       "      <td>CMIP</td>\n",
+       "      <td>NOAA-GFDL</td>\n",
+       "      <td>GFDL-CM4</td>\n",
+       "      <td>1pctCO2</td>\n",
+       "      <td>r1i1p1f1</td>\n",
+       "      <td>Amon</td>\n",
+       "      <td>tas</td>\n",
+       "      <td>gr1</td>\n",
+       "      <td>v20180701</td>\n",
+       "      <td>[CMIP6.CMIP.NOAA-GFDL.GFDL-CM4.1pctCO2.r1i1p1f...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1304</th>\n",
+       "      <td>CMIP6</td>\n",
+       "      <td>CMIP</td>\n",
+       "      <td>NASA-GISS</td>\n",
+       "      <td>GISS-E2-1-G</td>\n",
+       "      <td>1pctCO2</td>\n",
+       "      <td>r102i1p1f1</td>\n",
+       "      <td>Lmon</td>\n",
+       "      <td>nbp</td>\n",
+       "      <td>gn</td>\n",
+       "      <td>v20190815</td>\n",
+       "      <td>[CMIP6.CMIP.NASA-GISS.GISS-E2-1-G.1pctCO2.r102...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1309</th>\n",
+       "      <td>CMIP6</td>\n",
+       "      <td>CMIP</td>\n",
+       "      <td>MRI</td>\n",
+       "      <td>MRI-ESM2-0</td>\n",
+       "      <td>1pctCO2</td>\n",
+       "      <td>r1i2p1f1</td>\n",
+       "      <td>Amon</td>\n",
+       "      <td>tas</td>\n",
+       "      <td>gn</td>\n",
+       "      <td>v20191205</td>\n",
+       "      <td>[CMIP6.CMIP.MRI.MRI-ESM2-0.1pctCO2.r1i2p1f1.Am...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2048</th>\n",
+       "      <td>CMIP6</td>\n",
+       "      <td>CMIP</td>\n",
+       "      <td>MIROC</td>\n",
+       "      <td>MIROC-ES2H</td>\n",
+       "      <td>piControl</td>\n",
+       "      <td>r1i1p4f2</td>\n",
+       "      <td>Omon</td>\n",
+       "      <td>fgco2</td>\n",
+       "      <td>gr1</td>\n",
+       "      <td>v20230904</td>\n",
+       "      <td>[CMIP6.CMIP.MIROC.MIROC-ES2H.piControl.r1i1p4f...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2050</th>\n",
+       "      <td>CMIP6</td>\n",
+       "      <td>CMIP</td>\n",
+       "      <td>E3SM-Project</td>\n",
+       "      <td>E3SM-2-0-NARRM</td>\n",
+       "      <td>1pctCO2</td>\n",
+       "      <td>r1i1p1f1</td>\n",
+       "      <td>Amon</td>\n",
+       "      <td>tas</td>\n",
+       "      <td>gr</td>\n",
+       "      <td>v20230427</td>\n",
+       "      <td>[CMIP6.CMIP.E3SM-Project.E3SM-2-0-NARRM.1pctCO...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2051</th>\n",
+       "      <td>CMIP6</td>\n",
+       "      <td>CMIP</td>\n",
+       "      <td>E3SM-Project</td>\n",
+       "      <td>E3SM-2-0-NARRM</td>\n",
+       "      <td>piControl</td>\n",
+       "      <td>r1i1p1f1</td>\n",
+       "      <td>Amon</td>\n",
+       "      <td>tas</td>\n",
+       "      <td>gr</td>\n",
+       "      <td>v20230505</td>\n",
+       "      <td>[CMIP6.CMIP.E3SM-Project.E3SM-2-0-NARRM.piCont...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>399 rows × 11 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "     mip_era activity_id institution_id       source_id experiment_id  \\\n",
+       "0      CMIP6        CMIP           MOHC     UKESM1-0-LL     piControl   \n",
+       "1      CMIP6        CMIP            MRI      MRI-ESM2-0       1pctCO2   \n",
+       "2      CMIP6        CMIP          MPI-M   MPI-ESM1-2-LR     piControl   \n",
+       "3      CMIP6        CMIP           NCAR       CESM2-FV2     piControl   \n",
+       "4      CMIP6        CMIP      NOAA-GFDL        GFDL-CM4       1pctCO2   \n",
+       "...      ...         ...            ...             ...           ...   \n",
+       "1304   CMIP6        CMIP      NASA-GISS     GISS-E2-1-G       1pctCO2   \n",
+       "1309   CMIP6        CMIP            MRI      MRI-ESM2-0       1pctCO2   \n",
+       "2048   CMIP6        CMIP          MIROC      MIROC-ES2H     piControl   \n",
+       "2050   CMIP6        CMIP   E3SM-Project  E3SM-2-0-NARRM       1pctCO2   \n",
+       "2051   CMIP6        CMIP   E3SM-Project  E3SM-2-0-NARRM     piControl   \n",
+       "\n",
+       "       member_id table_id variable_id grid_label    version  \\\n",
+       "0       r1i1p1f2     Lmon         nbp         gn  v20200828   \n",
+       "1       r1i2p1f1     Omon       fgco2         gn  v20210311   \n",
+       "2       r1i1p1f1     Omon       fgco2         gn  v20190710   \n",
+       "3       r1i1p1f1     Amon         tas         gn  v20191120   \n",
+       "4       r1i1p1f1     Amon         tas        gr1  v20180701   \n",
+       "...          ...      ...         ...        ...        ...   \n",
+       "1304  r102i1p1f1     Lmon         nbp         gn  v20190815   \n",
+       "1309    r1i2p1f1     Amon         tas         gn  v20191205   \n",
+       "2048    r1i1p4f2     Omon       fgco2        gr1  v20230904   \n",
+       "2050    r1i1p1f1     Amon         tas         gr  v20230427   \n",
+       "2051    r1i1p1f1     Amon         tas         gr  v20230505   \n",
+       "\n",
+       "                                                     id  \n",
+       "0     [CMIP6.CMIP.MOHC.UKESM1-0-LL.piControl.r1i1p1f...  \n",
+       "1     [CMIP6.CMIP.MRI.MRI-ESM2-0.1pctCO2.r1i2p1f1.Om...  \n",
+       "2     [CMIP6.CMIP.MPI-M.MPI-ESM1-2-LR.piControl.r1i1...  \n",
+       "3     [CMIP6.CMIP.NCAR.CESM2-FV2.piControl.r1i1p1f1....  \n",
+       "4     [CMIP6.CMIP.NOAA-GFDL.GFDL-CM4.1pctCO2.r1i1p1f...  \n",
+       "...                                                 ...  \n",
+       "1304  [CMIP6.CMIP.NASA-GISS.GISS-E2-1-G.1pctCO2.r102...  \n",
+       "1309  [CMIP6.CMIP.MRI.MRI-ESM2-0.1pctCO2.r1i2p1f1.Am...  \n",
+       "2048  [CMIP6.CMIP.MIROC.MIROC-ES2H.piControl.r1i1p4f...  \n",
+       "2050  [CMIP6.CMIP.E3SM-Project.E3SM-2-0-NARRM.1pctCO...  \n",
+       "2051  [CMIP6.CMIP.E3SM-Project.E3SM-2-0-NARRM.piCont...  \n",
+       "\n",
+       "[399 rows x 11 columns]"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "cat.df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Model Groups\n",
+    "\n",
+    "However, `intake-esgf` also provides you with some tools to help locate relevant data for your analysis. When conducting these kinds of analyses, we are seeking for unique combinations of a `source_id`, `member_id`, and `grid_label` that have all the variables that we need. We call these *model groups*. In an ESGF search, it is common to find a model that has, for example, a `tas` for `r1i1p1f1` but not a `fgco2`. Sorting this out is time consuming and labor intensive. So first, we provide you a function to print out all model groups with the following function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cat.model_groups().to_frame()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The function `model_groups()` returns a pandas Series (converted to a dataframe here for printing) with all unique combinations of (`source_id`,`member_id`,`grid_label`) along with the dataset count for each. This helps illustrate why it can be so difficult to locate all the data relevant to a given analysis. At the time of this writing, there are 148 model groups but relatively few of them with all 6 (2 experiments and 3 variables) datasets that we need. Furthermore, you cannot rely on a model group using `r1i1p1f1` for its primary result. The results above show that UKESM does not even use `f1` at all, further complicating the process of finding results.\n",
+    "\n",
+    "In addition to this notion of *model groups*, `intake-esgf` provides you a method `remove_incomplete()` for determing which model groups you wish to keep in the current search. Internally, we will group the search results dataframe by model groups and apply a function of your design to the grouped portion of the dataframe. For example, for the current work, I could just check that there are 6 datasets in the sub-dataframe."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def shall_i_keep_it(sub_df):\n",
+    "    if len(sub_df) == 6:\n",
+    "        return True\n",
+    "    return False\n",
+    "\n",
+    "\n",
+    "cat.remove_incomplete(shall_i_keep_it)\n",
+    "cat.model_groups().to_frame()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You could write a much more complex check--it depends on what is relevant to your analysis. The effect is that the list of possible models with consistent results is now much more manageable. This method has the added benefit of forcing the user to be concrete about which models were included in an analysis.\n",
+    "\n",
+    "## Removing Additional Variants\n",
+    "\n",
+    "It may also be that you wish to only include a single `member_id` in your analysis. The above search shows we have a few models with multiple variants that have all 6 required datasets. To be fair to those that only have 1, you may wish to only keep the *smallest* variant. We also provide this function as part of the `ESGFCatalog` object.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cat.remove_ensembles()\n",
+    "cat.model_groups().to_frame()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary\n",
+    "\n",
+    "At this point, you would be ready to use `to_dataset_dict()` to download and load all datasets into a dictionary for analysis. The point of this notebook however is to expose the search capabilities. It is our goal to make annoying and time-consuming tasks easier by providing you smart interfaces for common operations. Let us [know](https://github.com/esgf2-us/intake-esgf/issues) what else is painful for you in locating relevant data for your science."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.15"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

	mip_era	activity_id	institution_id	source_id	experiment_id	member_id	table_id	variable_id	grid_label	version	id
0	CMIP6	CMIP	MOHC	UKESM1-0-LL	piControl	r1i1p1f2	Lmon	nbp	gn	v20200828	[CMIP6.CMIP.MOHC.UKESM1-0-LL.piControl.r1i1p1f...
1	CMIP6	CMIP	MRI	MRI-ESM2-0	1pctCO2	r1i2p1f1	Omon	fgco2	gn	v20210311	[CMIP6.CMIP.MRI.MRI-ESM2-0.1pctCO2.r1i2p1f1.Om...
2	CMIP6	CMIP	MPI-M	MPI-ESM1-2-LR	piControl	r1i1p1f1	Omon	fgco2	gn	v20190710	[CMIP6.CMIP.MPI-M.MPI-ESM1-2-LR.piControl.r1i1...
3	CMIP6	CMIP	NCAR	CESM2-FV2	piControl	r1i1p1f1	Amon	tas	gn	v20191120	[CMIP6.CMIP.NCAR.CESM2-FV2.piControl.r1i1p1f1....
4	CMIP6	CMIP	NOAA-GFDL	GFDL-CM4	1pctCO2	r1i1p1f1	Amon	tas	gr1	v20180701	[CMIP6.CMIP.NOAA-GFDL.GFDL-CM4.1pctCO2.r1i1p1f...
...	...	...	...	...	...	...	...	...	...	...	...
1304	CMIP6	CMIP	NASA-GISS	GISS-E2-1-G	1pctCO2	r102i1p1f1	Lmon	nbp	gn	v20190815	[CMIP6.CMIP.NASA-GISS.GISS-E2-1-G.1pctCO2.r102...
1309	CMIP6	CMIP	MRI	MRI-ESM2-0	1pctCO2	r1i2p1f1	Amon	tas	gn	v20191205	[CMIP6.CMIP.MRI.MRI-ESM2-0.1pctCO2.r1i2p1f1.Am...
2048	CMIP6	CMIP	MIROC	MIROC-ES2H	piControl	r1i1p4f2	Omon	fgco2	gr1	v20230904	[CMIP6.CMIP.MIROC.MIROC-ES2H.piControl.r1i1p4f...
2050	CMIP6	CMIP	E3SM-Project	E3SM-2-0-NARRM	1pctCO2	r1i1p1f1	Amon	tas	gr	v20230427	[CMIP6.CMIP.E3SM-Project.E3SM-2-0-NARRM.1pctCO...
2051	CMIP6	CMIP	E3SM-Project	E3SM-2-0-NARRM	piControl	r1i1p1f1	Amon	tas	gr	v20230505	[CMIP6.CMIP.E3SM-Project.E3SM-2-0-NARRM.piCont...