Skip to content

Commit

Permalink
Small getting started guide on writes
Browse files Browse the repository at this point in the history
  • Loading branch information
Fokko committed Jan 27, 2024
1 parent cd7fb50 commit 6c7d6e3
Show file tree
Hide file tree
Showing 3 changed files with 138 additions and 28 deletions.
2 changes: 1 addition & 1 deletion mkdocs/docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

<!-- prettier-ignore-start -->

- [Home](index.md)
- [Getting started](index.md)
- [Configuration](configuration.md)
- [CLI](cli.md)
- [API](api.md)
Expand Down
16 changes: 16 additions & 0 deletions mkdocs/docs/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,22 @@ For IDEA ≤2021 you need to install the [Poetry integration as a plugin](https:

Now you're set using Poetry, and all the tests will run in Poetry, and you'll have syntax highlighting in the pyproject.toml to indicate stale dependencies.

## Installation from source

Clone the repository for local development:

```sh
git clone https://github.com/apache/iceberg-python.git
cd iceberg-python
pip3 install -e ".[s3fs,hive]"
```

Install it directly for GitHub (not recommended), but sometimes handy:

```
pip install "git+https://github.com/apache/iceberg-python.git#egg=pyiceberg[s3fs]"
```

## Linting

`pre-commit` is used for autoformatting and linting:
Expand Down
148 changes: 121 additions & 27 deletions mkdocs/docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,11 @@ hide:
- limitations under the License.
-->

# PyIceberg
# Getting started with PyIceberg

PyIceberg is a Python implementation for accessing Iceberg tables, without the need of a JVM.

## Install
## Installation

Before installing PyIceberg, make sure that you're on an up-to-date version of `pip`:

Expand All @@ -38,36 +38,130 @@ You can install the latest release version from pypi:
pip install "pyiceberg[s3fs,hive]"
```

Install it directly for GitHub (not recommended), but sometimes handy:
You can mix and match optional dependencies depending on your needs:

| Key | Description: |
|--------------|----------------------------------------------------------------------|
| hive | Support for the Hive metastore |
| glue | Support for AWS Glue |
| dynamodb | Support for AWS DynamoDB |
| sql-postgres | Support for SQL Catalog backed by Postgresql |
| sql-sqlite | Support for SQL Catalog backed by SQLite |
| pyarrow | PyArrow as a FileIO implementation to interact with the object store |
| pandas | Installs both PyArrow and Pandas |
| duckdb | Installs both PyArrow and DuckDB |
| ray | Installs PyArrow, Pandas, and Ray |
| s3fs | S3FS as a FileIO implementation to interact with the object store |
| adlfs | ADLFS as a FileIO implementation to interact with the object store |
| snappy | Support for snappy Avro compression |
| gcs | GCS as the FileIO implementation to interact with the object store |

You either need to install `s3fs`, `adlfs`, `gcs`, or `pyarrow` to be able to fetch files from an object store.

## Connecting to a catalog

Iceberg leverages the [catalog to have one centralized place to organize the tables](https://iceberg.apache.org/catalog/). This can be a traditional Hive catalog to store your Iceberg tables next to the rest, a vendor solution like the AWS Glue catalog, or an implementation of Icebergs' own [REST protocol](https://github.com/apache/iceberg/tree/main/open-api). Checkout the [configuration](configuration.md) page to find all the configuration details.

## Write a PyArrow dataframe

Let's take the Taxi dataset, and write this to an Iceberg table.

First download one month of data:

```shell
curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o /tmp/yellow_tripdata_2023-01.parquet
```

Load it into your PyArrow dataframe:

```python
import pyarrow.parquet as pq

df = pq.read_table('/tmp/yellow_tripdata_2023-01.parquet')
```
pip install "git+https://github.com/apache/iceberg-python.git#egg=pyiceberg[s3fs]"

Create a new Iceberg table:

```python
from pyiceberg.catalog import load_catalog

catalog = load_catalog('default')

table = catalog.create_table(
"default.taxi_dataset",
schema=df.schema # Blocked by https://github.com/apache/iceberg-python/pull/305
)
```

Or clone the repository for local development:
Append the dataframe to the table:

```sh
git clone https://github.com/apache/iceberg-python.git
cd iceberg-python
pip3 install -e ".[s3fs,hive]"
```python
table.append(df)
len(table.scan().to_arrow())
```

You can mix and match optional dependencies depending on your needs:
3066766 rows have been written to the table.

Now generate a tip-per-mile feature to train the model on:

```python
import pyarrow.compute as pc
df = df.append_column(
"tip_per_mile",
pc.divide(df['tip_amount'], df['trip_distance'])
)
```

Evolve the schema of the table with the new column:

```python
from pyiceberg.catalog import Catalog
with table.update_schema() as update_schema:
# Blocked by https://github.com/apache/iceberg-python/pull/305
update_schema.union_by_name(Catalog._convert_schema_if_needed(df.schema))
```

And now we can write the new dataframe to the Iceberg table:

```python
table.overwrite(df)
print(table.scan().to_arrow())
```

And the new column is there:

```
taxi_dataset(
1: VendorID: optional long,
2: tpep_pickup_datetime: optional timestamp,
3: tpep_dropoff_datetime: optional timestamp,
4: passenger_count: optional double,
5: trip_distance: optional double,
6: RatecodeID: optional double,
7: store_and_fwd_flag: optional string,
8: PULocationID: optional long,
9: DOLocationID: optional long,
10: payment_type: optional long,
11: fare_amount: optional double,
12: extra: optional double,
13: mta_tax: optional double,
14: tip_amount: optional double,
15: tolls_amount: optional double,
16: improvement_surcharge: optional double,
17: total_amount: optional double,
18: congestion_surcharge: optional double,
19: airport_fee: optional double,
20: tip_per_mile: optional double
),
```

And we can see that 2371784 rows have a tip-per-mile:

```python
df = table.scan(row_filter='tip_per_mile > 0').to_arrow()
len(df)
```

## More details

| Key | Description: |
| -------- | -------------------------------------------------------------------- |
| hive | Support for the Hive metastore |
| glue | Support for AWS Glue |
| dynamodb | Support for AWS DynamoDB |
| pyarrow | PyArrow as a FileIO implementation to interact with the object store |
| pandas | Installs both PyArrow and Pandas |
| duckdb | Installs both PyArrow and DuckDB |
| ray | Installs PyArrow, Pandas, and Ray |
| s3fs | S3FS as a FileIO implementation to interact with the object store |
| adlfs | ADLFS as a FileIO implementation to interact with the object store |
| snappy | Support for snappy Avro compression |
| gcs | GCS as the FileIO implementation to interact with the object store |

You either need to install `s3fs`, `adlfs`, `gcs`, or `pyarrow` for fetching files.

There is both a [CLI](cli.md) and [Python API](api.md) available.
For the details, please check the [CLI](cli.md) or [Python API](api.md) page.

0 comments on commit 6c7d6e3

Please sign in to comment.