From 325709fc9e4f52264cf40345ea26e9edfebcbdfd Mon Sep 17 00:00:00 2001 From: Jeremy Kubica <104161096+jeremykubica@users.noreply.github.com> Date: Tue, 7 Jan 2025 12:48:52 -0500 Subject: [PATCH 1/4] Update quickstart.ipynb --- docs/gettingstarted/quickstart.ipynb | 140 +++++++++++++++++++++++++-- 1 file changed, 131 insertions(+), 9 deletions(-) diff --git a/docs/gettingstarted/quickstart.ipynb b/docs/gettingstarted/quickstart.ipynb index 6d7355e..4a72f1b 100644 --- a/docs/gettingstarted/quickstart.ipynb +++ b/docs/gettingstarted/quickstart.ipynb @@ -4,19 +4,23 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Quickstart" + "# Quickstart\n", + "\n", + "This notebook provides a brief introduction to nested-pandas, including the motivation and basics for working with the data structure. For more in-depth descriptions, see the other tutorial notebooks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ + "## Installation\n", + "\n", "With a valid Python environment, nested-pandas and it's dependencies are easy to install using the `pip` package manager. The following command can be used to install it:" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -27,7 +31,43 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Nested-Pandas is tailored towards efficient analysis of nested datasets. Let's load a toy dataset to show how it works." + "## Overview\n", + "\n", + "Nested-Pandas is tailored towards efficient analysis of nested data sets. This includes data that would normally be represented in a Pandas DataFrames with multiple rows needed to represent a single \"thing\" and therefor multiple columns whose values will be identical for that item.\n", + "\n", + "As a concrete example, consider an astronomical data set storing information about observations of physical objects, such as stars and galaxies. One way to represent this in Pandas is to create one row per observation with an ID column indicating to which physical object the observation corresponds. However this approach ends up repeating a lot of data over each observation of the same object such as its location on the sky (RA, dec), its classification, etc. Further any operations processing the data as time series requires the user to first perform a (potentially expensive) group-by operation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "# Represent nested time series information as a classic pandas dataframe.\n", + "my_data_frame = pd.DataFrame(\n", + " {\n", + " \"id\": [0, 0, 0, 1, 1],\n", + " \"ra\": [10.0, 10.0, 10.0, 15.0, 15.0],\n", + " \"dec\": [0.0, 0.0, 0.0, -1.0, -1.0],\n", + " \"time\": [60676.0, 60677.0, 60678.0, 60675.0, 60676.5],\n", + " \"brightness\": [100.0, 101.0, 99.8, 5.0, 5.01],\n", + " }\n", + ")\n", + "my_data_frame" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Beyond astronomical data we might be interested in tracking a patients blood pressure over time, a measure of intensities at different wavelengths, or storing a list of the type of rock found at different depths of core samples. In each case it is possible to represent this data with multiple rows (one row for each patient + measurement pair) and associate them together by ids.\n", + "\n", + "In contrast, nested-pandas allows columns to represent nested data. We can have columns with the (single) value for the objects’ unvarying characteristics (location on the sky, patentient birth date, location of the core sample) and nested columns for the values of each observation.\n", + "\n", + "Let's see an example:" ] }, { @@ -47,7 +87,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The above dataframe is a `NestedFrame`, which extends the capabilities of the Pandas `DataFrame` to support columns with nested information. In this example, we have the top level dataframe with 10 rows and 2 typical columns, \"a\" and \"b\". The \"nested\" column contains a dataframe in each row. We can inspect the contents of the \"nested\" column using pandas API tooling like `loc`." + "The above dataframe is a `NestedFrame`, which extends the capabilities of the Pandas `DataFrame` to support columns with nested information. \n", + "\n", + "In this example, we have the top level dataframe with 10 rows and 2 typical columns, \"a\" and \"b\". These represent object level attributes. The \"nested\" column contains a dataframe in each row that represents a time series (or observation level values). As we will see below, this allows easy access to the all of the observations for a given object.\n", + "\n", + "## Accessing Nested Data\n", + "\n", + "We can inspect the contents of the \"nested\" column using pandas API tooling like `loc`." ] }, { @@ -63,7 +109,79 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Here we see that within the \"nested\" column there are `NestedFrame` objects with their own data. In this case we have 3 columns (\"t\", \"flux\", and \"band\"). Alternatively, we could inspect the available columns using some custom properties of the `NestedFrame`." + "Here we see that within the \"nested\" column there are `NestedFrame` objects with their own data. In this case we have 3 columns (\"t\", \"flux\", and \"band\") that represent a time series of observations. \n", + "\n", + "Note that `loc` itself accesses the row, so the combination of `nf.loc[0][\"nested\"]` means we are looking at value in the \"nested\" column for a single row (row 0). If we just use `nf.loc[0]` we would retrieve the entire row, including the nested colum and all other columns. Similarly if we use `nf[\"nested”]` we retrieve the nested column for all rows. What makes the nesting useful is that once we access the nested entry for a specific row, we can treat the value as a table in its own right.\n", + "\n", + "As in Pandas, we can still access individual entries from a column ased on the row index. Thus we can access the entry (table) in row 0 of the nested column as `nf[\"nested\"][0]` as well." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "nf[\"nested\"][0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also use dot notation to access all the values in a nested sub column:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "nf[\"nested.t\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that \"nested.t\" contains the t values for all rows, but preserves the nesting information. The id column of the returned data maps the top-level row (in `nf`) whether this value resides.\n", + "\n", + "Similarly we can access the values for a given top-level row by index. To get all the `t` values for row 0 we could specify:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "nf[\"nested.t\"][0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here the `[0]` is telling our nested frame to access the values of the series `nf[\"nested.t\"]` where the id = 0. If we try `nf[\"nested.t\"][0][0]` we again match id = 0 and return the same frame. So to access a single element within the series, we can use its location:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "nf[\"nested.t\"][0].iloc[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Inspecting Nested Frames\n", + "\n", + "We can inspect the available columns using some custom properties of the `NestedFrame`." ] }, { @@ -90,7 +208,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "nested-pandas extends the Pandas API, meaning any operation you could do in Pandas is available within nested-pandas. However, nested-pandas has additional functionality and tooling to better support working with nested datasets. For example, let's look at `query`:" + "## Pandas Operations\n", + "\n", + "Nested-pandas extends the Pandas API, meaning any operation you could do in Pandas is available within nested-pandas. However, nested-pandas has additional functionality and tooling to better support working with nested datasets. For example, let's look at `query`:" ] }, { @@ -149,6 +269,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "## Reduce Function\n", + "\n", "Finally, we'll end with the flexible `reduce` function. `reduce` functions similarly to Pandas' `apply` but flattens (reduces) the inputs from nested layers into array inputs to the given apply function. For example, let's find the mean flux for each dataframe in \"nested\":" ] }, @@ -174,7 +296,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 15, "metadata": {}, "outputs": [], "source": [ @@ -211,7 +333,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "tdastro", "language": "python", "name": "python3" }, @@ -225,7 +347,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.11" + "version": "3.10.4" } }, "nbformat": 4, From 8e7d1414f6961aaa0e3b6a7df957357184ee9216 Mon Sep 17 00:00:00 2001 From: Jeremy Kubica <104161096+jeremykubica@users.noreply.github.com> Date: Tue, 7 Jan 2025 13:57:08 -0500 Subject: [PATCH 2/4] Clear notebook --- docs/gettingstarted/quickstart.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/gettingstarted/quickstart.ipynb b/docs/gettingstarted/quickstart.ipynb index 4a72f1b..966ffb5 100644 --- a/docs/gettingstarted/quickstart.ipynb +++ b/docs/gettingstarted/quickstart.ipynb @@ -20,7 +20,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -296,7 +296,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ From 9ded7770fe60883149d653fe7c6f75d501c04146 Mon Sep 17 00:00:00 2001 From: Jeremy Kubica <104161096+jeremykubica@users.noreply.github.com> Date: Wed, 8 Jan 2025 09:38:53 -0500 Subject: [PATCH 3/4] Make examples consistent --- docs/gettingstarted/quickstart.ipynb | 80 ++++++++++++++++------------ 1 file changed, 46 insertions(+), 34 deletions(-) diff --git a/docs/gettingstarted/quickstart.ipynb b/docs/gettingstarted/quickstart.ipynb index 966ffb5..0afc510 100644 --- a/docs/gettingstarted/quickstart.ipynb +++ b/docs/gettingstarted/quickstart.ipynb @@ -33,9 +33,11 @@ "source": [ "## Overview\n", "\n", - "Nested-Pandas is tailored towards efficient analysis of nested data sets. This includes data that would normally be represented in a Pandas DataFrames with multiple rows needed to represent a single \"thing\" and therefor multiple columns whose values will be identical for that item.\n", + "Nested-Pandas is tailored towards efficient analysis of nested data sets. This includes data that would normally be represented in a Pandas DataFrames with multiple rows needed to represent a single \"thing\" and therefor columns whose values will be identical for that item.\n", "\n", - "As a concrete example, consider an astronomical data set storing information about observations of physical objects, such as stars and galaxies. One way to represent this in Pandas is to create one row per observation with an ID column indicating to which physical object the observation corresponds. However this approach ends up repeating a lot of data over each observation of the same object such as its location on the sky (RA, dec), its classification, etc. Further any operations processing the data as time series requires the user to first perform a (potentially expensive) group-by operation." + "As a concrete example, consider an astronomical data set storing information about observations of physical objects, such as stars and galaxies. One way to represent this in Pandas is to create one row per observation with an ID column indicating to which physical object the observation corresponds. However this approach ends up repeating a lot of data over each observation of the same object such as its location on the sky (RA, dec), its classification, etc. Further, any operations processing the data as time series requires the user to first perform a (potentially expensive) group-by operation to aggregate all of the data for each object.\n", + "\n", + "Let's create a flat pandas dataframe with three objects: object 0 has three observations, object 1 has three observations, and object 2 has 4 observations." ] }, { @@ -49,11 +51,11 @@ "# Represent nested time series information as a classic pandas dataframe.\n", "my_data_frame = pd.DataFrame(\n", " {\n", - " \"id\": [0, 0, 0, 1, 1],\n", - " \"ra\": [10.0, 10.0, 10.0, 15.0, 15.0],\n", - " \"dec\": [0.0, 0.0, 0.0, -1.0, -1.0],\n", - " \"time\": [60676.0, 60677.0, 60678.0, 60675.0, 60676.5],\n", - " \"brightness\": [100.0, 101.0, 99.8, 5.0, 5.01],\n", + " \"id\": [0, 0, 0, 1, 1, 1, 2, 2, 2, 2],\n", + " \"ra\": [10.0, 10.0, 10.0, 15.0, 15.0, 15.0, 12.1, 12.1, 12.1, 12.1],\n", + " \"dec\": [0.0, 0.0, 0.0, -1.0, -1.0, -1.0, 0.5, 0.5, 0.5, 0.5],\n", + " \"time\": [60676.0, 60677.0, 60678.0, 60675.0, 60676.5, 60677.0, 60676.6, 60676.7, 60676.8, 60676.9],\n", + " \"brightness\": [100.0, 101.0, 99.8, 5.0, 5.01, 4.98, 20.1, 20.5, 20.3, 20.2],\n", " }\n", ")\n", "my_data_frame" @@ -63,9 +65,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Beyond astronomical data we might be interested in tracking a patients blood pressure over time, a measure of intensities at different wavelengths, or storing a list of the type of rock found at different depths of core samples. In each case it is possible to represent this data with multiple rows (one row for each patient + measurement pair) and associate them together by ids.\n", + "Note that we cannot cleanly compress this by adding more columns (such as such as t0, t1, and so forth), because the number of observations is not bounded and may vary from object to object.\n", + "\n", + "Beyond astronomical data we might be interested in tracking patients blood pressure over time, the measure of intensities of emitted light at different wavelengths, or storing a list of the type of rock found at different depths of core samples. In each case it is possible to represent this data with multiple rows (such as one row for each patient + measurement pair) and associate them together by ids.\n", "\n", - "In contrast, nested-pandas allows columns to represent nested data. We can have columns with the (single) value for the objects’ unvarying characteristics (location on the sky, patentient birth date, location of the core sample) and nested columns for the values of each observation.\n", + "Nested-pandas is designed for exactly this type of data by allowing columns to contain nested data. We can have regular columns with the (single) value for the objects’ unvarying characteristics (location on the sky, patentient birth date, location of the core sample) and nested columns for the values of each observation.\n", "\n", "Let's see an example:" ] @@ -76,10 +80,16 @@ "metadata": {}, "outputs": [], "source": [ - "from nested_pandas.datasets import generate_data\n", + "from nested_pandas.nestedframe import NestedFrame\n", "\n", - "# generate_data creates some toy data\n", - "nf = generate_data(10, 100) # 10 rows, 100 nested rows per row\n", + "# Create a nested data set\n", + "nf = NestedFrame.from_flat(\n", + " my_data_frame,\n", + " base_columns=[\"ra\", \"dec\"], # the columns not to nest\n", + " nested_columns=[\"time\", \"brightness\"], # the columns to nest\n", + " on=\"id\", # column used to associate rows\n", + " name=\"lightcurve\", # name of the nested column\n", + ")\n", "nf" ] }, @@ -89,11 +99,11 @@ "source": [ "The above dataframe is a `NestedFrame`, which extends the capabilities of the Pandas `DataFrame` to support columns with nested information. \n", "\n", - "In this example, we have the top level dataframe with 10 rows and 2 typical columns, \"a\" and \"b\". These represent object level attributes. The \"nested\" column contains a dataframe in each row that represents a time series (or observation level values). As we will see below, this allows easy access to the all of the observations for a given object.\n", + "We now have the top level dataframe with 3 rows, each of which corresponds to a single object. The table has three columns beyond \"id\". Two columns, \"ra\" and \"dec\", have a single value for the object (in this case the position on the sky). The last column \"lightcurve\" contains a nested table with a series of observation times and observation brightnesses for the object. As we will see below, this nested table allows the user to easily access to the all of the observations for a given object.\n", "\n", "## Accessing Nested Data\n", "\n", - "We can inspect the contents of the \"nested\" column using pandas API tooling like `loc`." + "We can inspect the contents of the \"lightcurve\" column using pandas API tooling like `loc`." ] }, { @@ -102,18 +112,18 @@ "metadata": {}, "outputs": [], "source": [ - "nf.loc[0][\"nested\"]" + "nf.loc[0][\"lightcurve\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Here we see that within the \"nested\" column there are `NestedFrame` objects with their own data. In this case we have 3 columns (\"t\", \"flux\", and \"band\") that represent a time series of observations. \n", + "Here we see that within the \"lightcurve\" column there are tables with their own data. In this case we have 2 columns (\"time\" and \"brightness\") that represent a time series of observations. \n", "\n", - "Note that `loc` itself accesses the row, so the combination of `nf.loc[0][\"nested\"]` means we are looking at value in the \"nested\" column for a single row (row 0). If we just use `nf.loc[0]` we would retrieve the entire row, including the nested colum and all other columns. Similarly if we use `nf[\"nested”]` we retrieve the nested column for all rows. What makes the nesting useful is that once we access the nested entry for a specific row, we can treat the value as a table in its own right.\n", + "Note that `loc` itself accesses the row, so the combination of `nf.loc[0][\"lightcurve\"]` means we are looking at value in the \"lightcurve\" column for a single row (row 0). If we just use `nf.loc[0]` we would retrieve the entire row, including the nested \"lightcurve\" column and all other columns. Similarly if we use `nf[\"lightcurve]` we retrieve the nested column for all rows. What makes the nesting useful is that once we access the nested entry for a specific row, we can treat the value as a table in its own right.\n", "\n", - "As in Pandas, we can still access individual entries from a column ased on the row index. Thus we can access the entry (table) in row 0 of the nested column as `nf[\"nested\"][0]` as well." + "As in Pandas, we can still access individual entries from a column based on the row index. Thus we can access the values (in a table) in row 0 of the nested column as `nf[\"lightcurve\"][0]` as well." ] }, { @@ -122,7 +132,7 @@ "metadata": {}, "outputs": [], "source": [ - "nf[\"nested\"][0]" + "nf[\"lightcurve\"][0]" ] }, { @@ -138,16 +148,16 @@ "metadata": {}, "outputs": [], "source": [ - "nf[\"nested.t\"]" + "nf[\"lightcurve.time\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Note that \"nested.t\" contains the t values for all rows, but preserves the nesting information. The id column of the returned data maps the top-level row (in `nf`) whether this value resides.\n", + "Note that \"lightcurve.time\" contains the time values for all rows, but also preserves the nesting information. The id column of the returned data maps the top-level row (in `nf`) to where this value resides.\n", "\n", - "Similarly we can access the values for a given top-level row by index. To get all the `t` values for row 0 we could specify:" + "Similarly, we can access the values for a given top-level row by index. To get all the `time` values for row 0 we could specify:" ] }, { @@ -156,14 +166,16 @@ "metadata": {}, "outputs": [], "source": [ - "nf[\"nested.t\"][0]" + "nf[\"lightcurve.time\"][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Here the `[0]` is telling our nested frame to access the values of the series `nf[\"nested.t\"]` where the id = 0. If we try `nf[\"nested.t\"][0][0]` we again match id = 0 and return the same frame. So to access a single element within the series, we can use its location:" + "Here the `[0]` is telling our nested frame to access the values of the series `nf[\"lightcurve.time\"]` where the id = 0. If we try `nf[\"lightcurve.time\"][0][0]` we again match id = 0 and return the same frame. \n", + "\n", + "To access a single element within the series, we need to use its location:" ] }, { @@ -172,7 +184,7 @@ "metadata": {}, "outputs": [], "source": [ - "nf[\"nested.t\"][0].iloc[0]" + "nf[\"lightcurve.time\"][0].iloc[0]" ] }, { @@ -220,7 +232,7 @@ "outputs": [], "source": [ "# Normal queries work as expected, rejecting rows from the dataframe that don't meet the criteria\n", - "nf.query(\"a > 0.2\")" + "nf.query(\"ra > 11.2\")" ] }, { @@ -236,8 +248,8 @@ "metadata": {}, "outputs": [], "source": [ - "# Applies the query to \"nested\", filtering based on \"t >17\"\n", - "nf_g = nf.query(\"nested.t > 17.0\")\n", + "# Applies the query to \"nested\", filtering based on \"time > 60676.0\"\n", + "nf_g = nf.query(\"lightcurve.time > 60676.0\")\n", "nf_g" ] }, @@ -254,8 +266,8 @@ "metadata": {}, "outputs": [], "source": [ - "# All t <= 17.0 have been removed\n", - "nf_g.loc[0][\"nested\"]" + "# All t <= 60676.0 have been removed\n", + "nf_g.loc[0][\"lightcurve\"]" ] }, { @@ -284,7 +296,7 @@ "\n", "# use hierarchical column names to access the flux column\n", "# passed as an array to np.mean\n", - "nf.reduce(np.mean, \"nested.flux\")" + "nf.reduce(np.mean, \"lightcurve.brightness\")" ] }, { @@ -308,7 +320,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Applying some inputs via reduce, we see how it sends inputs to a given function." + "Applying some inputs via reduce, we see how it sends inputs to a given function. The output frame `nf_inputs` consists of two columns containing the output of the “ra” column and the “lightcurve.time” column." ] }, { @@ -317,7 +329,7 @@ "metadata": {}, "outputs": [], "source": [ - "nf_inputs = nf.reduce(show_inputs, \"a\", \"nested.band\")\n", + "nf_inputs = nf.reduce(show_inputs, \"ra\", \"lightcurve.time\")\n", "nf_inputs" ] }, @@ -333,7 +345,7 @@ ], "metadata": { "kernelspec": { - "display_name": "tdastro", + "display_name": "nested", "language": "python", "name": "python3" }, From d76bebc6b983c9a8a06655c23a884f085eefec8b Mon Sep 17 00:00:00 2001 From: Jeremy Kubica <104161096+jeremykubica@users.noreply.github.com> Date: Wed, 8 Jan 2025 11:28:18 -0500 Subject: [PATCH 4/4] Extend the Data Loading Notebook --- docs/tutorials/data_loading_notebook.ipynb | 166 +++++++++++++++++++-- 1 file changed, 151 insertions(+), 15 deletions(-) diff --git a/docs/tutorials/data_loading_notebook.ipynb b/docs/tutorials/data_loading_notebook.ipynb index 8aa5e62..2e6a446 100644 --- a/docs/tutorials/data_loading_notebook.ipynb +++ b/docs/tutorials/data_loading_notebook.ipynb @@ -4,13 +4,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Loading Data into Nested-Pandas" + "# Loading Data into Nested-Pandas\n", + "\n", + "This notebook provides a brief introduction to loading data into nested-pandas or converting data into a nested structure. For an introduction to nested-pandas, see the quick start tutorial or the [readthedocs page](https://nested-pandas.readthedocs.io/en/latest/)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ + "## Installation and Imports\n", + "\n", "With a valid Python environment, nested-pandas and its dependencies are easy to install using the `pip` package manager. The following command can be used to install it:" ] }, @@ -29,25 +33,28 @@ "metadata": {}, "outputs": [], "source": [ - "from nested_pandas.datasets import generate_parquet_file\n", - "from nested_pandas import NestedFrame\n", - "from nested_pandas import read_parquet\n", - "\n", "import os\n", + "import tempfile\n", + "\n", "import pandas as pd\n", - "import tempfile" + "\n", + "from nested_pandas import NestedFrame, read_parquet\n", + "from nested_pandas.datasets import generate_parquet_file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Loading Data from Dictionaries\n", - "Nested-Pandas is tailored towards efficient analysis of nested datasets, and supports loading data from multiple sources.\n", + "# Overview\n", + "\n", + "Nested-pandas provides multiple mechanisms for loading data or converting data to the nested format. Below we walk through some of the common approaches.\n", "\n", - "We can use the `NestedFrame` constructor to create our base frame from a dictionary of our columns.\n", + "# Converting Flat Data\n", "\n", - "We can then create an addtional pandas dataframes and pack them into our `NestedFrame` with `NestedFrame.add_nested`." + "Commonly existing data sets will be provided in “flat” data structures such as dictionaries or Pandas DataFrames. In these cases the data consists of a rectangular table where each row represents an instance or observation. Multiple instances of the same top-level item are linked together through an ID. All rows with the same ID correspond to the same object/item.\n", + "\n", + "We define one such flat dataframe consisting of 10 rows for 3 distinct items." ] }, { @@ -56,17 +63,146 @@ "metadata": {}, "outputs": [], "source": [ - "nf = NestedFrame(data={\"a\": [1, 2, 3], \"b\": [2, 4, 6]}, index=[0, 1, 2])\n", + "flat_df = pd.DataFrame(\n", + " data={\n", + " \"a\": [1, 1, 1, 2, 2, 2, 3, 3, 3, 3],\n", + " \"b\": [2, 2, 2, 4, 4, 4, 6, 6, 6, 6],\n", + " \"c\": [0, 2, 4, 1, 4, 3, 1, 4, 1, 1],\n", + " \"d\": [5, 4, 7, 5, 3, 1, 9, 3, 4, 1],\n", + " },\n", + " index=[0, 0, 0, 1, 1, 1, 2, 2, 2, 2],\n", + ")\n", + "flat_df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The first column provides the object id. As we can see there are three rows with ID=0, three rows with ID=1, and four rows with ID=2. Some of the values are constant for each item. For example both columns “a” and “b” take a single value for object. We are wasting space by repeating them in every row. Other values are different per row (columns “c” and “d”).\n", + "\n", + "As a concrete example, consider patient records. Each patient is assigned a unique id and has static data such as a date birth. They also have measurements that are new with every trip to the doctor, such as blood pressure or temperature." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Converting from Flat Pandas\n", "\n", + "The easiest approach to converting the flat table above into a nested structure is to use `NestedFrame.from_flat()`. This function takes\n", + " * a list of columns that are not nested (base_columns)\n", + " * a list of columns to nest (nested_columns)\n", + " * the name of the nested column (name)\n", + "Rows are associated using the index by default, but a column name on which to join can also be provided." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "nf = NestedFrame.from_flat(\n", + " flat_df,\n", + " base_columns=[\"a\", \"b\"], # the columns not to nest\n", + " nested_columns=[\"c\", \"d\"], # the columns to nest\n", + " name=\"nested\", # name of the nested column\n", + ")\n", + "nf" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Inserting Nested Rows\n", + "\n", + "Alternatively, we can use the `NestedFrame` constructor to create our base frame from a dictionary of our columns (as we would do with a normal pandas DataFrame). This defines the top-level objects and the values that are constant across rows (\"a\" and \"b\")." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "nf = NestedFrame(\n", + " data={\n", + " \"a\": [1, 2, 3],\n", + " \"b\": [2, 4, 6],\n", + " },\n", + " index=[0, 1, 2],\n", + ")\n", + "nf" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can then create an addtional pandas dataframes for the nested columns and pack them into our `NestedFrame` with `NestedFrame.add_nested()` function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "nested = pd.DataFrame(\n", - " data={\"c\": [0, 2, 4, 1, 4, 3, 1, 4, 1], \"d\": [5, 4, 7, 5, 3, 1, 9, 3, 4]},\n", - " index=[0, 0, 0, 1, 1, 1, 2, 2, 2],\n", + " data={\n", + " \"c\": [0, 2, 4, 1, 4, 3, 1, 4, 1, 1],\n", + " \"d\": [5, 4, 7, 5, 3, 1, 9, 3, 4, 1],\n", + " },\n", + " index=[0, 0, 0, 1, 1, 1, 2, 2, 2, 2],\n", ")\n", "\n", "nf = nf.add_nested(nested, \"nested\")\n", "nf" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The \"index\" parameter is used to perform the association. All of the values for index=0 are bundled together into a sub-table and stored in row 0's \"nested\" column." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "nf.loc[0][\"nested\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We could add other nested columns by creating new sub-tables and adding them with `add_nested()`. Note that while the tables added with each `add_nested()` must be rectangular, they do not need to have the same dimensions between calls. We could add another nested row with a different number of observations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "nested = pd.DataFrame(\n", + " data={\n", + " \"c\": [0, 1, 0, 1, 2, 0],\n", + " \"d\": [5, 4, 5, 4, 3, 5],\n", + " },\n", + " index=[0, 0, 1, 1, 1, 2],\n", + ")\n", + "\n", + "nf = nf.add_nested(nested, \"nested2\")\n", + "nf" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -267,7 +403,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "nested", "language": "python", "name": "python3" }, @@ -281,7 +417,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.9" + "version": "3.10.4" } }, "nbformat": 4,