From 9c8aea1f64385eeb768b2b2428dbb4c225a7128a Mon Sep 17 00:00:00 2001 From: Phillip Cloud <417981+cpcloud@users.noreply.github.com> Date: Sun, 15 Sep 2024 07:14:22 -0400 Subject: [PATCH] chore(deps): remove the `pandas` extra (#10132) --- README.md | 2 +- .../index/execute-results/html.json | 9 +++++---- docs/concepts/internals.qmd | 6 +++--- docs/images/backends.png | Bin 204805 -> 159946 bytes .../ffill-and-bfill-using-ibis/index.qmd | 2 +- ibis/backends/tests/test_temporal.py | 1 - ibis/formats/pandas.py | 2 +- poetry.lock | 3 +-- pyproject.toml | 17 +---------------- 9 files changed, 13 insertions(+), 29 deletions(-) diff --git a/README.md b/README.md index 6c5059631be7..fa536947a76d 100644 --- a/README.md +++ b/README.md @@ -166,7 +166,7 @@ Ibis broadly supports two types of backend: 1. SQL-generating backends 2. DataFrame-generating backends -![Ibis backend types](https://raw.githubusercontent.com/ibis-project/ibis/main/docs/images/backends.png) +![Ibis backend types](./docs/images/backends.png) ## Portability diff --git a/docs/_freeze/posts/ffill-and-bfill-using-ibis/index/execute-results/html.json b/docs/_freeze/posts/ffill-and-bfill-using-ibis/index/execute-results/html.json index e8a7b927c455..91a95d4dfa71 100644 --- a/docs/_freeze/posts/ffill-and-bfill-using-ibis/index/execute-results/html.json +++ b/docs/_freeze/posts/ffill-and-bfill-using-ibis/index/execute-results/html.json @@ -1,14 +1,15 @@ { - "hash": "3f1e224f86b8b1f15c31f1c1ad1c99aa", + "hash": "36b4c01081d4c1bed8c73fcb8a9fa67c", "result": { - "markdown": "---\ntitle: \"`ffill` and `bfill` using Ibis\"\nauthor: Patrick Clarke\ndate: 2022-09-09\ncategories:\n - blog\n - window functions\n - time series\n---\n\nSuppose you have a table of data mapping events and dates to values, and that this data contains gaps in values.\n\nSuppose you want to forward fill these gaps such that, one-by-one,\nif a value is null, it is replaced by the non-null value preceding.\n\nFor example, you might be measuring the total value of an account over time.\nSaving the same value until that value changes is an inefficient use of space,\nso you might only measure the value during certain events,\nlike a change in ownership or value.\n\nIn that case, to view the value of the account by day, you might want to interpolate dates\nand then ffill or bfill value to show the account value over time by date.\n\nDate interpolation will be covered in a different guide,\nbut if you already have the dates then you can fill in some values.\n\nThis was heavily inspired by Gil Forsyth's writeup on ffill and bfill on the\n[Ibis GitHub Wiki](https://github.com/ibis-project/ibis/wiki/ffill-and-bfill-using-window-functions).\n\n### Setup\n\nFirst, we want to make some mock data.\nTo demonstrate this technique in a non-pandas backend, we will use the DuckDB backend.\n\nOur data will have measurements by date, and these measurements will be grouped by an event id.\nWe will then save this data to `data.parquet` so we can register that parquet file as a table in our DuckDB connector.\n\n::: {#5331b1af .cell execution_count=1}\n``` {.python .cell-code}\nfrom datetime import date\n\nimport numpy as np\nimport pandas as pd\n\nimport ibis\n\n\ndf = pd.DataFrame(\n {\n \"event_id\": [0] * 2 + [1] * 3 + [2] * 5 + [3] * 2,\n \"measured_on\": map(\n date,\n [2021] * 12, [6] * 4 + [5] * 6 + [7] * 2,\n range(1, 13),\n ),\n \"measurement\": np.nan,\n }\n)\n\ndf.head()\n```\n\n::: {.cell-output .cell-output-display execution_count=1}\n```{=html}\n
\n | event_id | \nmeasured_on | \nmeasurement | \n
---|---|---|---|
0 | \n0 | \n2021-06-01 | \nNaN | \n
1 | \n0 | \n2021-06-02 | \nNaN | \n
2 | \n1 | \n2021-06-03 | \nNaN | \n
3 | \n1 | \n2021-06-04 | \nNaN | \n
4 | \n1 | \n2021-05-05 | \nNaN | \n
\n | event_id | \nmeasured_on | \nmeasurement | \n
---|---|---|---|
0 | \n0 | \n2021-06-01 | \nNaN | \n
1 | \n0 | \n2021-06-02 | \n5.0 | \n
2 | \n1 | \n2021-06-03 | \nNaN | \n
3 | \n1 | \n2021-06-04 | \nNaN | \n
4 | \n1 | \n2021-05-05 | \n42.0 | \n
5 | \n2 | \n2021-05-06 | \n42.0 | \n
6 | \n2 | \n2021-05-07 | \nNaN | \n
7 | \n2 | \n2021-05-08 | \n11.0 | \n
8 | \n2 | \n2021-05-09 | \nNaN | \n
9 | \n2 | \n2021-05-10 | \nNaN | \n
10 | \n3 | \n2021-07-11 | \nNaN | \n
11 | \n3 | \n2021-07-12 | \nNaN | \n
DatabaseTable: data\n event_id int64\n measured_on date\n measurement float64\n\n```\n:::\n:::\n\n\n### `ffill` Strategy\n\nTo better understand how we can forward-fill our gaps, let's take a minute to explain the strategy and then look at\nthe manual result.\n\nWe will partition our data by event groups and then sort those groups by date.\n\nOur logic for forward fill is then: let `j` be an event group sorted by date and let `i` be a date within `j`.\nIf `i` is the first date in `j`, then continue.\nIf `i` is not the first date in `j`, then if `measurement` in `i` is null then replace it with `measurement` for `i-1`.\nOtherwise, do nothing.\n\nLet's take a look at what this means for the first few rows of our data:\n\n```\n event_id measured_on measurement\n0 0 2021-06-01 NaN # Since this is the first row of the event group (group 0), do nothing\n1 0 2021-06-02 5.0 # Since this is not the first row of the group and is not null: do nothing\n4 1 2021-05-05 42.0 # This is the first row of the event group (group 1): do nothing\n2 1 2021-06-03 NaN # This is not the first row and is null: replace it (NaN → 42.0)\n3 1 2021-06-04 NaN # This is not the first row and is null: replace it (NaN → 42.0)\n5 2 2021-05-06 42.0 # This is the first row of the event group (group 2): do nothing\n6 2 2021-05-07 NaN # This is not the first row and is null: replace it (NaN → 42.0)\n7 2 2021-05-08 11.0 # This is not the first row and is not null: do nothing\n8 2 2021-05-09 NaN # This is not the first row and is null: replace it (NaN → 11.0)\n9 2 2021-05-10 NaN # This is not the first row and is null: replace it (NaN → 11.0)\n10 3 2021-07-11 NaN # This is the first row of the event group (group 3): do nothing\n11 3 2021-07-12 NaN # This is not the first row and is null: replace it (NaN → NaN)\n```\n\nOur result should for forward fill should look like this:\n\n::: {#d25676e9 .cell execution_count=5}\n\n::: {.cell-output .cell-output-display execution_count=5}\n```{=html}\n
\n | event_id | \nmeasured_on | \nmeasurement | \n
---|---|---|---|
0 | \n0 | \n2021-06-01 | \nNaN | \n
1 | \n0 | \n2021-06-02 | \n5.0 | \n
2 | \n1 | \n2021-06-03 | \n5.0 | \n
3 | \n1 | \n2021-06-04 | \n5.0 | \n
4 | \n1 | \n2021-05-05 | \n42.0 | \n
5 | \n2 | \n2021-05-06 | \n42.0 | \n
6 | \n2 | \n2021-05-07 | \n42.0 | \n
7 | \n2 | \n2021-05-08 | \n11.0 | \n
8 | \n2 | \n2021-05-09 | \n11.0 | \n
9 | \n2 | \n2021-05-10 | \n11.0 | \n
10 | \n3 | \n2021-07-11 | \n11.0 | \n
11 | \n3 | \n2021-07-12 | \n11.0 | \n
\n | event_id | \nmeasured_on | \nmeasurement | \ngrouper | \n
---|---|---|---|---|
0 | \n0 | \n2021-06-01 | \nNaN | \n0 | \n
1 | \n0 | \n2021-06-02 | \n5.0 | \n1 | \n
7 | \n1 | \n2021-05-05 | \n42.0 | \n1 | \n
8 | \n1 | \n2021-06-03 | \nNaN | \n1 | \n
9 | \n1 | \n2021-06-04 | \nNaN | \n1 | \n
2 | \n2 | \n2021-05-06 | \n42.0 | \n1 | \n
3 | \n2 | \n2021-05-07 | \nNaN | \n1 | \n
4 | \n2 | \n2021-05-08 | \n11.0 | \n2 | \n
5 | \n2 | \n2021-05-09 | \nNaN | \n2 | \n
6 | \n2 | \n2021-05-10 | \nNaN | \n2 | \n
10 | \n3 | \n2021-07-11 | \nNaN | \n0 | \n
11 | \n3 | \n2021-07-12 | \nNaN | \n0 | \n
\n | event_id | \nmeasured_on | \nmeasurement | \ngrouper | \nffill | \n
---|---|---|---|---|---|
0 | \n0 | \n2021-06-01 | \nNaN | \n0 | \nNaN | \n
1 | \n0 | \n2021-06-02 | \n5.0 | \n1 | \n5.0 | \n
2 | \n1 | \n2021-05-05 | \n42.0 | \n1 | \n42.0 | \n
3 | \n1 | \n2021-06-03 | \nNaN | \n1 | \n42.0 | \n
4 | \n1 | \n2021-06-04 | \nNaN | \n1 | \n42.0 | \n
5 | \n2 | \n2021-05-06 | \n42.0 | \n1 | \n42.0 | \n
6 | \n2 | \n2021-05-07 | \nNaN | \n1 | \n42.0 | \n
7 | \n2 | \n2021-05-08 | \n11.0 | \n2 | \n11.0 | \n
8 | \n2 | \n2021-05-09 | \nNaN | \n2 | \n11.0 | \n
9 | \n2 | \n2021-05-10 | \nNaN | \n2 | \n11.0 | \n
10 | \n3 | \n2021-07-11 | \nNaN | \n0 | \nNaN | \n
11 | \n3 | \n2021-07-12 | \nNaN | \n0 | \nNaN | \n
\n | event_id | \nmeasured_on | \nmeasurement | \ngrouper | \n
---|---|---|---|---|
0 | \n0 | \n2021-06-01 | \nNaN | \n1 | \n
1 | \n0 | \n2021-06-02 | \n5.0 | \n1 | \n
2 | \n1 | \n2021-05-05 | \n42.0 | \n1 | \n
3 | \n1 | \n2021-06-03 | \nNaN | \n0 | \n
4 | \n1 | \n2021-06-04 | \nNaN | \n0 | \n
5 | \n2 | \n2021-05-06 | \n42.0 | \n2 | \n
6 | \n2 | \n2021-05-07 | \nNaN | \n1 | \n
7 | \n2 | \n2021-05-08 | \n11.0 | \n1 | \n
8 | \n2 | \n2021-05-09 | \nNaN | \n0 | \n
9 | \n2 | \n2021-05-10 | \nNaN | \n0 | \n
10 | \n3 | \n2021-07-11 | \nNaN | \n0 | \n
11 | \n3 | \n2021-07-12 | \nNaN | \n0 | \n
\n | event_id | \nmeasured_on | \nmeasurement | \ngrouper | \nbfill | \n
---|---|---|---|---|---|
0 | \n0 | \n2021-06-01 | \nNaN | \n1 | \n5.0 | \n
1 | \n0 | \n2021-06-02 | \n5.0 | \n1 | \n5.0 | \n
2 | \n1 | \n2021-05-05 | \n42.0 | \n1 | \n42.0 | \n
3 | \n1 | \n2021-06-03 | \nNaN | \n0 | \nNaN | \n
4 | \n1 | \n2021-06-04 | \nNaN | \n0 | \nNaN | \n
5 | \n2 | \n2021-05-06 | \n42.0 | \n2 | \n42.0 | \n
6 | \n2 | \n2021-05-07 | \nNaN | \n1 | \n11.0 | \n
7 | \n2 | \n2021-05-08 | \n11.0 | \n1 | \n11.0 | \n
8 | \n2 | \n2021-05-09 | \nNaN | \n0 | \nNaN | \n
9 | \n2 | \n2021-05-10 | \nNaN | \n0 | \nNaN | \n
10 | \n3 | \n2021-07-11 | \nNaN | \n0 | \nNaN | \n
11 | \n3 | \n2021-07-12 | \nNaN | \n0 | \nNaN | \n
\n | event_id | \nmeasured_on | \nmeasurement | \ngrouper | \nbfill | \n
---|---|---|---|---|---|
10 | \n1 | \n2021-05-05 | \n42.0 | \n4 | \n42.0 | \n
11 | \n2 | \n2021-05-06 | \n42.0 | \n3 | \n42.0 | \n
5 | \n2 | \n2021-05-07 | \nNaN | \n2 | \n11.0 | \n
4 | \n2 | \n2021-05-08 | \n11.0 | \n2 | \n11.0 | \n
9 | \n2 | \n2021-05-09 | \nNaN | \n1 | \n5.0 | \n
8 | \n2 | \n2021-05-10 | \nNaN | \n1 | \n5.0 | \n
7 | \n0 | \n2021-06-01 | \nNaN | \n1 | \n5.0 | \n
6 | \n0 | \n2021-06-02 | \n5.0 | \n1 | \n5.0 | \n
3 | \n1 | \n2021-06-03 | \nNaN | \n0 | \nNaN | \n
2 | \n1 | \n2021-06-04 | \nNaN | \n0 | \nNaN | \n
1 | \n3 | \n2021-07-11 | \nNaN | \n0 | \nNaN | \n
0 | \n3 | \n2021-07-12 | \nNaN | \n0 | \nNaN | \n
\n | event_id | \nmeasured_on | \nmeasurement | \n
---|---|---|---|
0 | \n0 | \n2021-06-01 | \nNaN | \n
1 | \n0 | \n2021-06-02 | \nNaN | \n
2 | \n1 | \n2021-06-03 | \nNaN | \n
3 | \n1 | \n2021-06-04 | \nNaN | \n
4 | \n1 | \n2021-05-05 | \nNaN | \n
\n | event_id | \nmeasured_on | \nmeasurement | \n
---|---|---|---|
0 | \n0 | \n2021-06-01 | \nNaN | \n
1 | \n0 | \n2021-06-02 | \n5.0 | \n
2 | \n1 | \n2021-06-03 | \nNaN | \n
3 | \n1 | \n2021-06-04 | \nNaN | \n
4 | \n1 | \n2021-05-05 | \n42.0 | \n
5 | \n2 | \n2021-05-06 | \n42.0 | \n
6 | \n2 | \n2021-05-07 | \nNaN | \n
7 | \n2 | \n2021-05-08 | \n11.0 | \n
8 | \n2 | \n2021-05-09 | \nNaN | \n
9 | \n2 | \n2021-05-10 | \nNaN | \n
10 | \n3 | \n2021-07-11 | \nNaN | \n
11 | \n3 | \n2021-07-12 | \nNaN | \n
DatabaseTable: data\n event_id int64\n measured_on date\n measurement float64\n\n```\n:::\n:::\n\n\n### `ffill` Strategy\n\nTo better understand how we can forward-fill our gaps, let's take a minute to explain the strategy and then look at\nthe manual result.\n\nWe will partition our data by event groups and then sort those groups by date.\n\nOur logic for forward fill is then: let `j` be an event group sorted by date and let `i` be a date within `j`.\nIf `i` is the first date in `j`, then continue.\nIf `i` is not the first date in `j`, then if `measurement` in `i` is null then replace it with `measurement` for `i-1`.\nOtherwise, do nothing.\n\nLet's take a look at what this means for the first few rows of our data:\n\n```\n event_id measured_on measurement\n0 0 2021-06-01 NaN # Since this is the first row of the event group (group 0), do nothing\n1 0 2021-06-02 5.0 # Since this is not the first row of the group and is not null: do nothing\n4 1 2021-05-05 42.0 # This is the first row of the event group (group 1): do nothing\n2 1 2021-06-03 NaN # This is not the first row and is null: replace it (NaN → 42.0)\n3 1 2021-06-04 NaN # This is not the first row and is null: replace it (NaN → 42.0)\n5 2 2021-05-06 42.0 # This is the first row of the event group (group 2): do nothing\n6 2 2021-05-07 NaN # This is not the first row and is null: replace it (NaN → 42.0)\n7 2 2021-05-08 11.0 # This is not the first row and is not null: do nothing\n8 2 2021-05-09 NaN # This is not the first row and is null: replace it (NaN → 11.0)\n9 2 2021-05-10 NaN # This is not the first row and is null: replace it (NaN → 11.0)\n10 3 2021-07-11 NaN # This is the first row of the event group (group 3): do nothing\n11 3 2021-07-12 NaN # This is not the first row and is null: replace it (NaN → NaN)\n```\n\nOur result should for forward fill should look like this:\n\n::: {#96c70bb9 .cell execution_count=5}\n\n::: {.cell-output .cell-output-display execution_count=5}\n```{=html}\n
\n | event_id | \nmeasured_on | \nmeasurement | \n
---|---|---|---|
0 | \n0 | \n2021-06-01 | \nNaN | \n
1 | \n0 | \n2021-06-02 | \n5.0 | \n
2 | \n1 | \n2021-06-03 | \n5.0 | \n
3 | \n1 | \n2021-06-04 | \n5.0 | \n
4 | \n1 | \n2021-05-05 | \n42.0 | \n
5 | \n2 | \n2021-05-06 | \n42.0 | \n
6 | \n2 | \n2021-05-07 | \n42.0 | \n
7 | \n2 | \n2021-05-08 | \n11.0 | \n
8 | \n2 | \n2021-05-09 | \n11.0 | \n
9 | \n2 | \n2021-05-10 | \n11.0 | \n
10 | \n3 | \n2021-07-11 | \n11.0 | \n
11 | \n3 | \n2021-07-12 | \n11.0 | \n
\n | event_id | \nmeasured_on | \nmeasurement | \ngrouper | \n
---|---|---|---|---|
10 | \n0 | \n2021-06-01 | \nNaN | \n0 | \n
11 | \n0 | \n2021-06-02 | \n5.0 | \n1 | \n
7 | \n1 | \n2021-05-05 | \n42.0 | \n1 | \n
8 | \n1 | \n2021-06-03 | \nNaN | \n1 | \n
9 | \n1 | \n2021-06-04 | \nNaN | \n1 | \n
0 | \n2 | \n2021-05-06 | \n42.0 | \n1 | \n
1 | \n2 | \n2021-05-07 | \nNaN | \n1 | \n
2 | \n2 | \n2021-05-08 | \n11.0 | \n2 | \n
3 | \n2 | \n2021-05-09 | \nNaN | \n2 | \n
4 | \n2 | \n2021-05-10 | \nNaN | \n2 | \n
5 | \n3 | \n2021-07-11 | \nNaN | \n0 | \n
6 | \n3 | \n2021-07-12 | \nNaN | \n0 | \n
\n | event_id | \nmeasured_on | \nmeasurement | \ngrouper | \nffill | \n
---|---|---|---|---|---|
0 | \n0 | \n2021-06-01 | \nNaN | \n0 | \nNaN | \n
1 | \n0 | \n2021-06-02 | \n5.0 | \n1 | \n5.0 | \n
2 | \n1 | \n2021-05-05 | \n42.0 | \n1 | \n42.0 | \n
3 | \n1 | \n2021-06-03 | \nNaN | \n1 | \n42.0 | \n
4 | \n1 | \n2021-06-04 | \nNaN | \n1 | \n42.0 | \n
5 | \n2 | \n2021-05-06 | \n42.0 | \n1 | \n42.0 | \n
6 | \n2 | \n2021-05-07 | \nNaN | \n1 | \n42.0 | \n
7 | \n2 | \n2021-05-08 | \n11.0 | \n2 | \n11.0 | \n
8 | \n2 | \n2021-05-09 | \nNaN | \n2 | \n11.0 | \n
9 | \n2 | \n2021-05-10 | \nNaN | \n2 | \n11.0 | \n
10 | \n3 | \n2021-07-11 | \nNaN | \n0 | \nNaN | \n
11 | \n3 | \n2021-07-12 | \nNaN | \n0 | \nNaN | \n
\n | event_id | \nmeasured_on | \nmeasurement | \ngrouper | \n
---|---|---|---|---|
0 | \n0 | \n2021-06-01 | \nNaN | \n1 | \n
1 | \n0 | \n2021-06-02 | \n5.0 | \n1 | \n
2 | \n1 | \n2021-05-05 | \n42.0 | \n1 | \n
3 | \n1 | \n2021-06-03 | \nNaN | \n0 | \n
4 | \n1 | \n2021-06-04 | \nNaN | \n0 | \n
5 | \n2 | \n2021-05-06 | \n42.0 | \n2 | \n
6 | \n2 | \n2021-05-07 | \nNaN | \n1 | \n
7 | \n2 | \n2021-05-08 | \n11.0 | \n1 | \n
8 | \n2 | \n2021-05-09 | \nNaN | \n0 | \n
9 | \n2 | \n2021-05-10 | \nNaN | \n0 | \n
10 | \n3 | \n2021-07-11 | \nNaN | \n0 | \n
11 | \n3 | \n2021-07-12 | \nNaN | \n0 | \n
\n | event_id | \nmeasured_on | \nmeasurement | \ngrouper | \nbfill | \n
---|---|---|---|---|---|
0 | \n0 | \n2021-06-01 | \nNaN | \n1 | \n5.0 | \n
1 | \n0 | \n2021-06-02 | \n5.0 | \n1 | \n5.0 | \n
2 | \n1 | \n2021-05-05 | \n42.0 | \n1 | \n42.0 | \n
3 | \n1 | \n2021-06-03 | \nNaN | \n0 | \nNaN | \n
4 | \n1 | \n2021-06-04 | \nNaN | \n0 | \nNaN | \n
5 | \n2 | \n2021-05-06 | \n42.0 | \n2 | \n42.0 | \n
6 | \n2 | \n2021-05-07 | \nNaN | \n1 | \n11.0 | \n
7 | \n2 | \n2021-05-08 | \n11.0 | \n1 | \n11.0 | \n
8 | \n2 | \n2021-05-09 | \nNaN | \n0 | \nNaN | \n
9 | \n2 | \n2021-05-10 | \nNaN | \n0 | \nNaN | \n
10 | \n3 | \n2021-07-11 | \nNaN | \n0 | \nNaN | \n
11 | \n3 | \n2021-07-12 | \nNaN | \n0 | \nNaN | \n
\n | event_id | \nmeasured_on | \nmeasurement | \ngrouper | \nbfill | \n
---|---|---|---|---|---|
1 | \n1 | \n2021-05-05 | \n42.0 | \n4 | \n42.0 | \n
0 | \n2 | \n2021-05-06 | \n42.0 | \n3 | \n42.0 | \n
3 | \n2 | \n2021-05-07 | \nNaN | \n2 | \n11.0 | \n
2 | \n2 | \n2021-05-08 | \n11.0 | \n2 | \n11.0 | \n
7 | \n2 | \n2021-05-09 | \nNaN | \n1 | \n5.0 | \n
6 | \n2 | \n2021-05-10 | \nNaN | \n1 | \n5.0 | \n
5 | \n0 | \n2021-06-01 | \nNaN | \n1 | \n5.0 | \n
4 | \n0 | \n2021-06-02 | \n5.0 | \n1 | \n5.0 | \n
11 | \n1 | \n2021-06-03 | \nNaN | \n0 | \nNaN | \n
10 | \n1 | \n2021-06-04 | \nNaN | \n0 | \nNaN | \n
9 | \n3 | \n2021-07-11 | \nNaN | \n0 | \nNaN | \n
8 | \n3 | \n2021-07-12 | \nNaN | \n0 | \nNaN | \n