Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(pandas): remove the pandas backend #10112

Merged
merged 4 commits into from
Sep 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 0 additions & 4 deletions .github/workflows/ibis-backends.yml
Original file line number Diff line number Diff line change
Expand Up @@ -117,10 +117,6 @@ jobs:
extras:
- clickhouse
- examples
- name: pandas
title: Pandas
extras:
- pandas
- name: sqlite
title: SQLite
extras:
Expand Down
5 changes: 1 addition & 4 deletions docs/backends/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,7 @@ def get_renderer(level: int) -> MdRenderer:

@cache
def get_backend(backend: str):
if backend == "pandas":
return get_object(f"ibis.backends.{backend}", "BasePandasBackend")
else:
return get_object(f"ibis.backends.{backend}", "Backend")
return get_object(f"ibis.backends.{backend}", "Backend")


def get_callable(obj, name):
Expand Down
212 changes: 3 additions & 209 deletions docs/backends/pandas.qmd
Original file line number Diff line number Diff line change
@@ -1,213 +1,7 @@
# pandas

[https://pandas.pydata.org/](https://pandas.pydata.org/)

![](https://img.shields.io/badge/memtables-native-green?style=flat-square) ![](https://img.shields.io/badge/inputs-CSV | Parquet-blue?style=flat-square) ![](https://img.shields.io/badge/outputs-CSV | pandas | Parquet | PyArrow-orange?style=flat-square)

::: {.callout-warning}
## The Pandas backend is slated for removal in Ibis 10.0
We recommend using one of our other backends.

Many workloads work well on the DuckDB and Polars backends, for example.
:::


## Install

Install Ibis and dependencies for the pandas backend:

::: {.panel-tabset}

## `pip`

Install with the `pandas` extra:

```{.bash}
pip install 'ibis-framework[pandas]'
```

And connect:

```{.python}
import ibis

con = ibis.pandas.connect() # <1>
```

1. Adjust connection parameters as needed.

## `conda`

Install for pandas:

```{.bash}
conda install -c conda-forge ibis-pandas
```

And connect:

```{.python}
import ibis

con = ibis.pandas.connect() # <1>
```

1. Adjust connection parameters as needed.

## `mamba`

Install for pandas:

```{.bash}
mamba install -c conda-forge ibis-pandas
```

And connect:

```{.python}
import ibis

con = ibis.pandas.connect() # <1>
```

1. Adjust connection parameters as needed.
::: {.callout-note}
## The pandas backend was removed in Ibis version 10.0

See [our blog post](../posts/farewell-pandas/index.qmd) on the topic for more information.
:::



## User Defined functions (UDF)

Ibis supports defining three kinds of user-defined functions for operations on
expressions targeting the pandas backend: **element-wise**, **reduction**, and
**analytic**.

### Elementwise Functions

An **element-wise** function is a function that takes N rows as input and
produces N rows of output. `log`, `exp`, and `floor` are examples of
element-wise functions.

Here's how to define an element-wise function:

```python
import ibis.expr.datatypes as dt
from ibis.backends.pandas.udf import udf

@udf.elementwise(input_type=[dt.int64], output_type=dt.double)
def add_one(x):
return x + 1.0
```

### Reduction Functions

A **reduction** is a function that takes N rows as input and produces 1 row
as output. `sum`, `mean` and `count` are examples of reductions. In
the context of a `GROUP BY`, reductions produce 1 row of output _per
group_.

Here's how to define a reduction function:

```python
import ibis.expr.datatypes as dt
from ibis.backends.pandas.udf import udf

@udf.reduction(input_type=[dt.double], output_type=dt.double)
def double_mean(series):
return 2 * series.mean()
```

### Analytic Functions

An **analytic** function is like an **element-wise** function in that it takes
N rows as input and produces N rows of output. The key difference is that
analytic functions can be applied _per group_ using window functions. Z-score
is an example of an analytic function.

Here's how to define an analytic function:

```python
import ibis.expr.datatypes as dt
from ibis.backends.pandas.udf import udf

@udf.analytic(input_type=[dt.double], output_type=dt.double)
def zscore(series):
return (series - series.mean()) / series.std()
```

### Details of pandas UDFs

- Element-wise provide support
for applying your UDF to any combination of scalar values and columns.
- Reductions provide support for
whole column aggregations, grouped aggregations, and application of your
function over a window.
- Analytic functions work in both grouped and non-grouped
settings
- The objects you receive as input arguments are either `pandas.Series` or
Python/NumPy scalars.

::: {.callout-warning}
## Keyword arguments must be given a default

Any keyword arguments must be given a default value or the function **will
not work**.
:::

A common Python convention is to set the default value to `None` and
handle setting it to something not `None` in the body of the function.

Using `add_one` from above as an example, the following call will receive a
`pandas.Series` for the `x` argument:

```python
import ibis
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3]})
con = ibis.pandas.connect({'df': df})
t = con.table('df')
expr = add_one(t.a)
expr
```

And this will receive the `int` 1:

```python
expr = add_one(1)
expr
```

Since the pandas backend passes around `**kwargs` you can accept `**kwargs`
in your function:

```python
import ibis.expr.datatypes as dt
from ibis.backends.pandas.udf import udf

@udf.elementwise([dt.int64], dt.double)
def add_two(x, **kwargs): # do stuff with kwargs
return x + 2.0
```

Or you can leave them out as we did in the example above. You can also
optionally accept specific keyword arguments.

For example:

```python
import ibis.expr.datatypes as dt
from ibis.backends.pandas.udf import udf

@udf.elementwise([dt.int64], dt.double)
def add_two_with_none(x, y=None):
if y is None:
y = 2.0
return x + y
```

```{python}
#| echo: false
BACKEND = "Pandas"
```

{{< include ./_templates/api.qmd >}}
2 changes: 1 addition & 1 deletion docs/backends_sankey.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ def to_greyish(hex_code, grey_value=128):
"SQLite",
"Trino",
],
list(category_colors.keys())[2]: ["Dask", "pandas", "Polars"],
list(category_colors.keys())[2]: ["Polars"],
}

nodes, links = [], []
Expand Down
22 changes: 2 additions & 20 deletions ibis/backends/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,13 @@
import importlib
import importlib.metadata
import itertools
import operator
from functools import cache
from pathlib import Path
from typing import TYPE_CHECKING, Any

import _pytest
import pytest
from packaging.requirements import Requirement
from packaging.version import parse as vparse

import ibis
from ibis import util
Expand All @@ -30,22 +28,6 @@
from ibis.backends.tests.base import BackendTest


def compare_versions(module_name, given_version, op):
try:
current_version = importlib.metadata.version(module_name)
return op(vparse(current_version), vparse(given_version))
except importlib.metadata.PackageNotFoundError:
return False


def is_newer_than(module_name, given_version):
return compare_versions(module_name, given_version, operator.gt)


def is_older_than(module_name, given_version):
return compare_versions(module_name, given_version, operator.lt)


TEST_TABLES = {
"functional_alltypes": ibis.schema(
{
Expand Down Expand Up @@ -486,7 +468,7 @@ def _setup_backend(request, data_dir, tmp_path_factory, worker_id):


@pytest.fixture(
params=_get_backends_to_test(discard=("pandas",)),
params=_get_backends_to_test(),
scope="session",
)
def ddl_backend(request, data_dir, tmp_path_factory, worker_id):
Expand All @@ -501,7 +483,7 @@ def ddl_con(ddl_backend):


@pytest.fixture(
params=_get_backends_to_test(keep=("pandas", "pyspark")),
params=_get_backends_to_test(keep=("pyspark",)),
scope="session",
)
def udf_backend(request, data_dir, tmp_path_factory, worker_id):
Expand Down
Loading
Loading