Skip to content

Commit

Permalink
clean up sink documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
lfunderburk committed Nov 13, 2024
1 parent 896d864 commit c329934
Show file tree
Hide file tree
Showing 2 changed files with 39 additions and 63 deletions.
61 changes: 37 additions & 24 deletions src/bytewax/duckdb/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,43 @@
- The `DuckDBSinkPartition` class manages the writing of data in batches
and executes custom SQL statements to create tables if specified.
When working with this sink in Bytewax, you can use it to process data in
batch and write data to a target database or file in a structured way.
However, there’s one essential assumption you need to know: the sink expects
data in a specific tuple format, structured as:
```python
("key", List[Dict])
```
Where
`"key"`: The first element is a string identifier for the batch.
Think of this as a “batch ID” that helps to organize and keep
track of which group of entries belong together.
Every batch you send to the sink has a unique key or identifier.
`List[Dict]`: The second element is a list of dictionaries.
Each dictionary represents an individual data record,
with each key-value pair in the dictionary representing
fields and their corresponding values.
Together, the tuple tells the sink: “Here is a batch of data,
labeled with a specific key, and this batch contains multiple data entries.”
This format is designed to let the sink write data efficiently in batches,
rather than handling each entry one-by-one. By grouping data entries
together with an identifier, the sink can:
* Optimize Writing: Batching data reduces the frequency of writes to
the database or file, which can dramatically improve performance,
especially when processing high volumes of data.
* Ensure Atomicity: By writing a batch as a single unit,
we minimize the risk of partial writes, ensuring either the whole
batch is written or none at all. This is especially important for
maintaining data integrity.
Warning:
This module requires a commercial license for non-prototype use with
business data. Set the environment variable `BYTEWAX_LICENSE=1` to suppress
Expand All @@ -26,31 +63,7 @@
Python's logging library is used to log essential events, such as connection
status, batch operations, and any error messages.
**Sample usage**:
```python
from bytewax.duckdb_sink import DuckDBSink
# Define the SQL statement to create the table if it doesn't exist
create_table_sql = "CREATE TABLE IF NOT EXISTS my_table (\
id STRING,\
content STRING,\
timestamp TIMESTAMP)"
# Initialize the DuckDBSink with your database path and table details
duckdb_sink = DuckDBSink(
db_path="path/to/your/database.duckdb",
table_name="my_table",
create_table_sql=create_table_sql
)
# Alternatively, you can use a MotherDuck connection string with a token
duckdb_sink = DuckDBSink(
db_path="md://your-connection-string",
table_name="my_table",
create_table_sql=create_table_sql
)
```
Note: For further examples and usage patterns, refer to the
[Bytewax DuckDB documentation](https://github.com/bytewax/bytewax-duckdb).
"""
Expand Down
41 changes: 2 additions & 39 deletions src/bytewax/duckdb/operators.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,43 +3,6 @@
This module provides operators for writing data to DuckDB or MotherDuck
using the Bytewax DuckDB sink.
When working with this sink in Bytewax, we’re dealing with tools that let us
batch and write data to a target database or file in a structured way.
However, there’s one essential assumption you need to know: the sink expects
data in a specific tuple format, structured as:
```python
("key", List[Dict])
```
Where
`"key"`: The first element is a string identifier for the batch.
Think of this as a “batch ID” that helps to organize and keep
track of which group of entries belong together.
Every batch you send to the sink has a unique key or identifier.
`List[Dict]`: The second element is a list of dictionaries.
Each dictionary represents an individual data record,
with each key-value pair in the dictionary representing
fields and their corresponding values.
Together, the tuple tells the sink: “Here is a batch of data,
labeled with a specific key, and this batch contains multiple data entries.”
This format is designed to let the sink write data efficiently in batches,
rather than handling each entry one-by-one. By grouping data entries
together with an identifier, the sink can:
Optimize Writing: Batching data reduces the frequency of writes to
the database or file, which can dramatically improve performance,
especially when processing high volumes of data.
Ensure Atomicity: By writing a batch as a single unit,
we minimize the risk of partial writes, ensuring either the whole
batch is written or none at all. This is especially important for
maintaining data integrity.
Usage:
```
Expand All @@ -62,8 +25,8 @@ def create_dict(value: int) -> Tuple[str, Dict[str, Union[int, str]]]:
name = random.choice(names)
age = random.randint(20, 60) # Random age between 20 and 60
location = random.choice(locations)
# The following marks batch '1' to be written to the database
return ("1", {"id": value, "name": name, "age": age, "location": location})
# This tuple represents the key and the dictionary to be written
return (str(value), {"id": value, "name": name, "age": age, "location": location})
# Generate input data
inp = op.input("inp", flow, TestingSource(range(50)))
Expand Down

0 comments on commit c329934

Please sign in to comment.