Update quickstart.ipynb

lincc-frameworks · Jan 7, 2025 · 325709f · 325709f
1 parent b60f6eb
commit 325709f
Showing 1 changed file with 131 additions and 9 deletions.
diff --git a/docs/gettingstarted/quickstart.ipynb b/docs/gettingstarted/quickstart.ipynb
@@ -4,19 +4,23 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Quickstart"
+    "# Quickstart\n",
+    "\n",
+    "This notebook provides a brief introduction to nested-pandas, including the motivation and basics for working with the data structure. For more in-depth descriptions, see the other tutorial notebooks."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Installation\n",
+    "\n",
     "With a valid Python environment, nested-pandas and it's dependencies are easy to install using the `pip` package manager. The following command can be used to install it:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -27,7 +31,43 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Nested-Pandas is tailored towards efficient analysis of nested datasets. Let's load a toy dataset to show how it works."
+    "## Overview\n",
+    "\n",
+    "Nested-Pandas is tailored towards efficient analysis of nested data sets. This includes data that would normally be represented in a Pandas DataFrames with multiple rows needed to represent a single \"thing\" and therefor multiple columns whose values will be identical for that item.\n",
+    "\n",
+    "As a concrete example, consider an astronomical data set storing information about observations of physical objects, such as stars and galaxies. One way to represent this in Pandas is to create one row per observation with an ID column indicating to which physical object the observation corresponds. However this approach ends up repeating a lot of data over each observation of the same object such as its location on the sky (RA, dec), its classification, etc. Further any operations processing the data as time series requires the user to first perform a (potentially expensive) group-by operation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "# Represent nested time series information as a classic pandas dataframe.\n",
+    "my_data_frame = pd.DataFrame(\n",
+    "    {\n",
+    "        \"id\": [0, 0, 0, 1, 1],\n",
+    "        \"ra\": [10.0, 10.0, 10.0, 15.0, 15.0],\n",
+    "        \"dec\": [0.0, 0.0, 0.0, -1.0, -1.0],\n",
+    "        \"time\": [60676.0, 60677.0, 60678.0, 60675.0, 60676.5],\n",
+    "        \"brightness\": [100.0, 101.0, 99.8, 5.0, 5.01],\n",
+    "    }\n",
+    ")\n",
+    "my_data_frame"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Beyond astronomical data we might be interested in tracking a patients blood pressure over time, a measure of intensities at different wavelengths, or storing a list of the type of rock found at different depths of core samples. In each case it is possible to represent this data with multiple rows (one row for each patient + measurement pair) and associate them together by ids.\n",
+    "\n",
+    "In contrast, nested-pandas allows columns to represent nested data. We can have columns with the (single) value for the objects’ unvarying characteristics (location on the sky, patentient birth date, location of the core sample) and nested columns for the values of each observation.\n",
+    "\n",
+    "Let's see an example:"
    ]
   },
   {
@@ -47,7 +87,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The above dataframe is a `NestedFrame`, which extends the capabilities of the Pandas `DataFrame` to support columns with nested information. In this example, we have the top level dataframe with 10 rows and 2 typical columns, \"a\" and \"b\". The \"nested\" column contains a dataframe in each row. We can inspect the contents of the \"nested\" column using pandas API tooling like `loc`."
+    "The above dataframe is a `NestedFrame`, which extends the capabilities of the Pandas `DataFrame` to support columns with nested information. \n",
+    "\n",
+    "In this example, we have the top level dataframe with 10 rows and 2 typical columns, \"a\" and \"b\".  These represent object level attributes. The \"nested\" column contains a dataframe in each row that represents a time series (or observation level values).  As we will see below, this allows easy access to the all of the observations for a given object.\n",
+    "\n",
+    "## Accessing Nested Data\n",
+    "\n",
+    "We can inspect the contents of the \"nested\" column using pandas API tooling like `loc`."
    ]
   },
   {
@@ -63,7 +109,79 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Here we see that within the \"nested\" column there are `NestedFrame` objects with their own data. In this case we have 3 columns (\"t\", \"flux\", and \"band\"). Alternatively, we could inspect the available columns using some custom properties of the `NestedFrame`."
+    "Here we see that within the \"nested\" column there are `NestedFrame` objects with their own data. In this case we have 3 columns (\"t\", \"flux\", and \"band\") that represent a time series of observations. \n",
+    "\n",
+    "Note that `loc` itself accesses the row, so the combination of `nf.loc[0][\"nested\"]` means we are looking at value in the \"nested\" column for a single row (row 0). If we just use `nf.loc[0]` we would retrieve the entire row, including the nested colum and all other columns. Similarly if we use `nf[\"nested”]` we retrieve the nested column for all rows. What makes the nesting useful is that once we access the nested entry for a specific row, we can treat the value as a table in its own right.\n",
+    "\n",
+    "As in Pandas, we can still access individual entries from a column ased on the row index. Thus we can access the entry (table) in row 0 of the nested column as `nf[\"nested\"][0]` as well."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "nf[\"nested\"][0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can also use dot notation to access all the values in a nested sub column:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "nf[\"nested.t\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note that \"nested.t\" contains the t values for all rows, but preserves the nesting information. The id column of the returned data maps the top-level row (in `nf`) whether this value resides.\n",
+    "\n",
+    "Similarly we can access the values for a given top-level row by index. To get all the `t` values for row 0 we could specify:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "nf[\"nested.t\"][0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here the `[0]` is telling our nested frame to access the values of the series `nf[\"nested.t\"]` where the id = 0. If we try `nf[\"nested.t\"][0][0]` we again match id = 0 and return the same frame. So to access a single element within the series, we can use its location:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "nf[\"nested.t\"][0].iloc[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Inspecting Nested Frames\n",
+    "\n",
+    "We can inspect the available columns using some custom properties of the `NestedFrame`."
    ]
   },
   {
@@ -90,7 +208,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "nested-pandas extends the Pandas API, meaning any operation you could do in Pandas is available within nested-pandas. However, nested-pandas has additional functionality and tooling to better support working with nested datasets. For example, let's look at `query`:"
+    "## Pandas Operations\n",
+    "\n",
+    "Nested-pandas extends the Pandas API, meaning any operation you could do in Pandas is available within nested-pandas. However, nested-pandas has additional functionality and tooling to better support working with nested datasets. For example, let's look at `query`:"
    ]
   },
   {
@@ -149,6 +269,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Reduce Function\n",
+    "\n",
     "Finally, we'll end with the flexible `reduce` function. `reduce` functions similarly to Pandas' `apply` but flattens (reduces) the inputs from nested layers into array inputs to the given apply function. For example, let's find the mean flux for each dataframe in \"nested\":"
    ]
   },
@@ -174,7 +296,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 15,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -211,7 +333,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "tdastro",
    "language": "python",
    "name": "python3"
   },
@@ -225,7 +347,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.11"
+   "version": "3.10.4"
   }
  },
  "nbformat": 4,