edit argovis section

ProjectPythia · Jun 13, 2024 · 6ea2013 · 6ea2013
1 parent 1ff11e7
commit 6ea2013
Showing 1 changed file with 233 additions and 27 deletions.
diff --git a/notebooks/argo-access.ipynb b/notebooks/argo-access.ipynb
@@ -35,11 +35,11 @@
     "\n",
     "Building upon previous notebook, [Introduction to Argo](notebooks/argo-introduction.ipynb), we next explore how to access Argo data using various methods.\n",
     "\n",
-    "These methods are described fully on their respective websites, linked below. Our goal here is to provide a brief overview of some of the different tools available. \n",
+    "These methods are described in more detail on their respective websites, linked below. Our goal here is to provide a brief overview of some of the different tools available. \n",
     "\n",
-    "1. Introducing data formats for Argo profiles\n",
-    "2. Using [Argopy](https://argopy.readthedocs.io/en/latest/user-guide/fetching-argo-data/index.html), a dedicated Python package\n",
-    "3. Using [Argovis](https://argovis.colorado.edu/argo) for API-based queries \n",
+    "1. [GO-BGC Toolbox](https://github.com/go-bgc/workshop-python) \n",
+    "2. [Argopy](https://argopy.readthedocs.io/en/latest/user-guide/fetching-argo-data/index.html), a dedicated Python package\n",
+    "3. [Argovis](https://argovis.colorado.edu/argo) for API-based queries \n",
     "\n",
     "<!-- 2. Downloading [monthly snapshots](http://www.argodatamgt.org/Access-to-data/Argo-DOI-Digital-Object-Identifier) using Argo DOI's -->\n",
     "<!-- 4. Using the [GO-BGC Toolbox](https://github.com/go-bgc/workshop-python) -->\n",
@@ -82,7 +82,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 59,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -95,6 +95,11 @@
     "import xarray as xr\n",
     "from datetime import datetime, timedelta\n",
     "\n",
+    "import requests\n",
+    "import time\n",
+    "import urllib3\n",
+    "import shutil\n",
+    "\n",
     "import matplotlib.pyplot as plt\n",
     "import matplotlib.colors as mcolors\n",
     "import seaborn as sns\n",
@@ -107,44 +112,252 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Background: Common xarray formats"
+    "## 1. Downloading with the GO-BGC Toolbox\n",
+    "\n",
+    "In the previous notebook, [Introduction to Argo](notebooks/argo-introduction.ipynb), we saw how Argo synthetic profile ('[sprof](https://archimer.ifremer.fr/doc/00445/55637/)') data is stored in netcdf4 format.\n",
+    "\n",
+    "Using the GDAC function allows you to subset and download Sprof's for multiple floats. \n",
+    "We recommend this tool for users who only need a few profilesd in a specific area of interest. \n",
+    "Considerations: \n",
+    "- Easy to use and understand\n",
+    "- Downloads float data as individual .nc files to your local machine (takes up storage space)\n",
+    "- Must download all variables available (cannot subset only variables of interest)\n",
+    "\n",
+    "The two major functions below are courtesy of the [GO-BGC Toolbox](https://github.com/go-bgc/workshop-python) (Ethan Campbell). A full tutorial is available in the Toolbox.\n"
    ]
   },
   {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": 65,
    "metadata": {},
+   "outputs": [],
    "source": [
-    "## 1. Using GO-BGC Toolbox GDAC function\n",
+    "# # Base filepath. Need for Argo GDAC function.z\n",
+    "# root = '/Users/sangminsong/Library/CloudStorage/OneDrive-UW/Code/2024_Pythia/'\n",
+    "# profile_dir = root + 'SOCCOM_GO-BGC_LoResQC_LIAR_28Aug2023_netcdf/'\n",
     "\n",
-    "We recommend this tool for users who only need a few profiles in a specific area of interest. \n",
-    "Considerations: \n",
-    "- Easy to use and understand\n",
-    "- Downloads synthetic profiles"
+    "# # Base filepath. Need for Argo GDAC function.\n",
+    "root = '../data/'\n",
+    "profile_dir = root + 'bgc-argo/'"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Argo profiles are "
+    "### 1.0 GO-BGC Toolbox Functions"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 63,
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "# Function to download a single file (From GO-BGC Toolbox)\n",
+    "def download_file(url_path,filename,save_to=None,overwrite=False,verbose=True):\n",
+    "    \"\"\" Downloads and saves a file from a given URL using HTTP protocol.\n",
+    "\n",
+    "    Note: If '404 file not found' error returned, function will return without downloading anything.\n",
+    "    \n",
+    "    Arguments:\n",
+    "        url_path: root URL to download from including trailing slash ('/')\n",
+    "        filename: filename to download including suffix\n",
+    "        save_to: None (to download to root Google Drive GO-BGC directory)\n",
+    "                 or directory path\n",
+    "        overwrite: False to leave existing files in place\n",
+    "                   or True to overwrite existing files\n",
+    "        verbose: True to announce progress\n",
+    "                 or False to stay silent\n",
+    "    \n",
+    "    \"\"\"\n",
+    "    urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)\n",
+    "\n",
+    "    if save_to is None:\n",
+    "      save_to = root #profile_dir  # EDITED HERE\n",
+    "\n",
+    "    try:\n",
+    "      if filename in os.listdir(save_to):\n",
+    "          if not overwrite:\n",
+    "              if verbose: print('>>> File ' + filename + ' already exists. Leaving current version.')\n",
+    "              return\n",
+    "          else:\n",
+    "              if verbose: print('>>> File ' + filename + ' already exists. Overwriting with new version.')\n",
+    "\n",
+    "      def get_func(url,stream=True):\n",
+    "          try:\n",
+    "              return requests.get(url,stream=stream,auth=None,verify=False)\n",
+    "          except requests.exceptions.ConnectionError as error_tag:\n",
+    "              print('Error connecting:',error_tag)\n",
+    "              time.sleep(1)\n",
+    "              return get_func(url,stream=stream)\n",
+    "\n",
+    "      response = get_func(url_path + filename,stream=True)\n",
+    "\n",
+    "      if response.status_code == 404:\n",
+    "          if verbose: print('>>> File ' + filename + ' returned 404 error during download.')\n",
+    "          return\n",
+    "      with open(save_to + filename,'wb') as out_file:\n",
+    "          shutil.copyfileobj(response.raw,out_file)\n",
+    "      del response\n",
+    "      if verbose: print('>>> Successfully downloaded ' + filename + '.')\n",
+    "\n",
+    "    except:\n",
+    "      if verbose: print('>>> An error occurred while trying to download ' + filename + '.')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 64,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Function to download and parse GDAC synthetic profile index file (GO-BGC Toolbox)\n",
+    "def argo_gdac(lat_range=None,lon_range=None,start_date=None,end_date=None,sensors=None,floats=None,\n",
+    "              overwrite_index=False,overwrite_profiles=False,skip_download=False,\n",
+    "              download_individual_profs=False,save_to=None,verbose=True):\n",
+    "  \"\"\" Downloads GDAC Sprof index file, then selects float profiles based on criteria.\n",
+    "      Either returns information on profiles and floats (if skip_download=True) or downloads them (if False).\n",
+    "\n",
+    "      Arguments:\n",
+    "          lat_range: None, to select all latitudes\n",
+    "                     or [lower, upper] within -90 to 90 (selection is inclusive)\n",
+    "          lon_range: None, to select all longitudes\n",
+    "                     or [lower, upper] within either -180 to 180 or 0 to 360 (selection is inclusive)\n",
+    "                     NOTE: longitude range is allowed to cross -180/180 or 0/360\n",
+    "          start_date: None or datetime object\n",
+    "          end_date:   None or datetime object\n",
+    "          sensors: None, to select profiles with any combination of sensors\n",
+    "                   or string or list of strings to specify required sensors\n",
+    "                   > note that common options include PRES, TEMP, PSAL, DOXY, CHLA, BBP700,\n",
+    "                                                      PH_IN_SITU_TOTAL, and NITRATE\n",
+    "          floats: None, to select any floats matching other criteria\n",
+    "                  or int or list of ints specifying floats' WMOID numbers\n",
+    "          overwrite_index: False to keep existing downloaded GDAC index file, or True to download new index\n",
+    "          overwrite_profiles: False to keep existing downloaded profile files, or True to download new files\n",
+    "          skip_download: True to skip download and return: (, ,\n",
+    "                                                            )\n",
+    "                         or False to download those profiles\n",
+    "          download_individual_profs: False to download single Sprof file containing all profiles for each float\n",
+    "                                     or True to download individual profile files for each float\n",
+    "          save_to: None to download to Google Drive \"/GO-BGC Workshop/Profiles\" directory\n",
+    "                   or string to specify directory path for profile downloads\n",
+    "          verbose: True to announce progress, or False to stay silent\n",
+    "\n",
+    "  \"\"\"\n",
+    "  # Paths\n",
+    "  url_root = 'https://www.usgodae.org/ftp/outgoing/argo/'\n",
+    "  dac_url_root = url_root + 'dac/'\n",
+    "  index_filename = 'argo_synthetic-profile_index.txt'\n",
+    "  if save_to is None: save_to = root\n",
+    "\n",
+    "  # Download GDAC synthetic profile index file\n",
+    "  download_file(url_root,index_filename,overwrite=overwrite_index)\n",
+    "\n",
+    "  # Load index file into Pandas DataFrame\n",
+    "  gdac_index = pd.read_csv(root + index_filename,delimiter=',',header=8,parse_dates=['date','date_update'],\n",
+    "                          date_parser=lambda x: pd.to_datetime(x,format='%Y%m%d%H%M%S'))\n",
+    "\n",
+    "  # Establish time and space criteria\n",
+    "  if lat_range is None:  lat_range = [-90.0,90.0]\n",
+    "  if lon_range is None:  lon_range = [-180.0,180.0]\n",
+    "  elif lon_range[0] > 180 or lon_range[1] > 180:\n",
+    "    if lon_range[0] > 180: lon_range[0] -= 360\n",
+    "    if lon_range[1] > 180: lon_range[1] -= 360\n",
+    "  if start_date is None: start_date = datetime(1900,1,1)\n",
+    "  if end_date is None:   end_date = datetime(2200,1,1)\n",
+    "\n",
+    "  float_wmoid_regexp = r'[a-z]*/[0-9]*/profiles/[A-Z]*([0-9]*)_[0-9]*[A-Z]*.nc'\n",
+    "  gdac_index['wmoid'] = gdac_index['file'].str.extract(float_wmoid_regexp).astype(int)\n",
+    "  filepath_main_regexp = '([a-z]*/[0-9]*/)profiles/[A-Z]*[0-9]*_[0-9]*[A-Z]*.nc'\n",
+    "  gdac_index['filepath_main'] = gdac_index['file'].str.extract(filepath_main_regexp)\n",
+    "  filepath_regexp = '([a-z]*/[0-9]*/profiles/)[A-Z]*[0-9]*_[0-9]*[A-Z]*.nc'\n",
+    "  gdac_index['filepath'] = gdac_index['file'].str.extract(filepath_regexp)\n",
+    "  filename_regexp = '[a-z]*/[0-9]*/profiles/([A-Z]*[0-9]*_[0-9]*[A-Z]*.nc)'\n",
+    "  gdac_index['filename'] = gdac_index['file'].str.extract(filename_regexp)\n",
+    "\n",
+    "  # Subset profiles based on time and space criteria\n",
+    "  gdac_index_subset = gdac_index.loc[np.logical_and.reduce([gdac_index['latitude'] >= lat_range[0],\n",
+    "                                                            gdac_index['latitude'] <= lat_range[1],\n",
+    "                                                            gdac_index['date'] >= start_date,\n",
+    "                                                            gdac_index['date'] <= end_date]),:]\n",
+    "  if lon_range[1] >= lon_range[0]:    # range does not cross -180/180 or 0/360\n",
+    "    gdac_index_subset = gdac_index_subset.loc[np.logical_and(gdac_index_subset['longitude'] >= lon_range[0],\n",
+    "                                                             gdac_index_subset['longitude'] <= lon_range[1])]\n",
+    "  elif lon_range[1] < lon_range[0]:   # range crosses -180/180 or 0/360\n",
+    "    gdac_index_subset = gdac_index_subset.loc[np.logical_or(gdac_index_subset['longitude'] >= lon_range[0],\n",
+    "                                                            gdac_index_subset['longitude'] <= lon_range[1])]\n",
+    "\n",
+    "  # If requested, subset profiles using float WMOID criteria\n",
+    "  if floats is not None:\n",
+    "    if type(floats) is not list: floats = [floats]\n",
+    "    gdac_index_subset = gdac_index_subset.loc[gdac_index_subset['wmoid'].isin(floats),:]\n",
+    "\n",
+    "  # If requested, subset profiles using sensor criteria\n",
+    "  if sensors is not None:\n",
+    "    if type(sensors) is not list: sensors = [sensors]\n",
+    "    for sensor in sensors:\n",
+    "      gdac_index_subset = gdac_index_subset.loc[gdac_index_subset['parameters'].str.contains(sensor),:]\n",
+    "\n",
+    "  # Examine subsetted profiles\n",
+    "  wmoids = gdac_index_subset['wmoid'].unique()\n",
+    "  wmoid_filepaths = gdac_index_subset['filepath_main'].unique()\n",
+    "\n",
+    "  # Just return list of floats and DataFrame with subset of index file, or download each profile\n",
+    "  if not skip_download:\n",
+    "    downloaded_filenames = []\n",
+    "    if download_individual_profs:\n",
+    "      for p_idx in gdac_index_subset.index:\n",
+    "        download_file(dac_url_root + gdac_index_subset.loc[p_idx]['filepath'],\n",
+    "                      gdac_index_subset.loc[p_idx]['filename'],\n",
+    "                      save_to=save_to,overwrite=overwrite_profiles,verbose=verbose)\n",
+    "        downloaded_filenames.append(gdac_index_subset.loc[p_idx]['filename'])\n",
+    "    else:\n",
+    "      for f_idx, wmoid_filepath in enumerate(wmoid_filepaths):\n",
+    "        download_file(dac_url_root + wmoid_filepath,str(wmoids[f_idx]) + '_Sprof.nc',\n",
+    "                      save_to=save_to,overwrite=overwrite_profiles,verbose=verbose)\n",
+    "        downloaded_filenames.append(str(wmoids[f_idx]) + '_Sprof.nc')\n",
+    "    return wmoids, gdac_index_subset, downloaded_filenames\n",
+    "  else:\n",
+    "    return wmoids, gdac_index_subset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 1.1 Using GDAC function to access Argo subsets"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "# # Base filepath. Need for Argo GDAC function.\n",
-    "root = '/Users/sangminsong/Library/CloudStorage/OneDrive-UW/Code/2024_Pythia/'\n",
-    "profile_dir = root + 'SOCCOM_GO-BGC_LoResQC_LIAR_28Aug2023_netcdf/'"
+    "# dont download, just get wmoids\n",
+    "# wmoids, gdac_index = argo_gdac(lat_range=lat_bounds,lon_range=lon_bounds,\n",
+    "#                                start_date=start_yd,end_date=end_yd,\n",
+    "#                                sensors=None,floats=None,\n",
+    "#                                overwrite_index=True,overwrite_profiles=False,\n",
+    "#                                skip_download=True,download_individual_profs=False,\n",
+    "#                                save_to=profile_dir,verbose=True)\n",
+    "\n",
+    "# download specific float #5906030 \n",
+    "wmoids, gdac_index, downloaded_filenames \\\n",
+    "                   = argo_gdac(lat_range=None,lon_range=None,\n",
+    "                               start_date=None,end_date=None,\n",
+    "                               sensors=None,floats=5906030,\n",
+    "                               overwrite_index=True,overwrite_profiles=False,\n",
+    "                               skip_download=False,download_individual_profs=False,\n",
+    "                               save_to=profile_dir,verbose=True)"
    ]
   },
   {
@@ -162,18 +375,11 @@
     "# # DSdict['5906030']"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Using the `Argopy` Python Package"
+    "## 2. Using the Argopy Python Package"
    ]
   },
   {
@@ -187,7 +393,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Querying Data with `Argovis`"
+    "## 3. Querying Data with Argovis"
    ]
   },
   {