clean up sink documentation

bytewax · Nov 13, 2024 · c329934 · c329934
1 parent 896d864
commit c329934
Show file tree

Hide file tree

Showing 2 changed files with 39 additions and 63 deletions.
diff --git a/src/bytewax/duckdb/__init__.py b/src/bytewax/duckdb/__init__.py
@@ -17,6 +17,43 @@
     - The `DuckDBSinkPartition` class manages the writing of data in batches
       and executes custom SQL statements to create tables if specified.
 
+When working with this sink in Bytewax, you can use it to process data in
+batch and write data to a target database or file in a structured way.
+However, there’s one essential assumption you need to know: the sink expects
+data in a specific tuple format, structured as:
+
+```python
+("key", List[Dict])
+```
+
+Where
+
+`"key"`: The first element is a string identifier for the batch.
+Think of this as a “batch ID” that helps to organize and keep
+track of which group of entries belong together.
+Every batch you send to the sink has a unique key or identifier.
+
+`List[Dict]`: The second element is a list of dictionaries.
+Each dictionary represents an individual data record,
+with each key-value pair in the dictionary representing
+fields and their corresponding values.
+
+Together, the tuple tells the sink: “Here is a batch of data,
+labeled with a specific key, and this batch contains multiple data entries.”
+
+This format is designed to let the sink write data efficiently in batches,
+rather than handling each entry one-by-one. By grouping data entries
+together with an identifier, the sink can:
+
+* Optimize Writing: Batching data reduces the frequency of writes to
+the database or file, which can dramatically improve performance,
+especially when processing high volumes of data.
+
+* Ensure Atomicity: By writing a batch as a single unit,
+we minimize the risk of partial writes, ensuring either the whole
+batch is written or none at all. This is especially important for
+maintaining data integrity.
+
 Warning:
     This module requires a commercial license for non-prototype use with
     business data. Set the environment variable `BYTEWAX_LICENSE=1` to suppress
@@ -26,31 +63,7 @@
     Python's logging library is used to log essential events, such as connection
     status, batch operations, and any error messages.
 
-**Sample usage**:
 
-```python
-from bytewax.duckdb_sink import DuckDBSink
-
-# Define the SQL statement to create the table if it doesn't exist
-create_table_sql = "CREATE TABLE IF NOT EXISTS my_table (\
-    id STRING,\
-    content STRING,\
-    timestamp TIMESTAMP)"
-
-# Initialize the DuckDBSink with your database path and table details
-duckdb_sink = DuckDBSink(
-    db_path="path/to/your/database.duckdb",
-    table_name="my_table",
-    create_table_sql=create_table_sql
-)
-
-# Alternatively, you can use a MotherDuck connection string with a token
-duckdb_sink = DuckDBSink(
-    db_path="md://your-connection-string",
-    table_name="my_table",
-    create_table_sql=create_table_sql
-)
-```
 Note: For further examples and usage patterns, refer to the
 [Bytewax DuckDB documentation](https://github.com/bytewax/bytewax-duckdb).
 """

diff --git a/src/bytewax/duckdb/operators.py b/src/bytewax/duckdb/operators.py
@@ -3,43 +3,6 @@
 This module provides operators for writing data to DuckDB or MotherDuck
 using the Bytewax DuckDB sink.
 
-When working with this sink in Bytewax, we’re dealing with tools that let us
-batch and write data to a target database or file in a structured way.
-However, there’s one essential assumption you need to know: the sink expects
-data in a specific tuple format, structured as:
-
-```python
-("key", List[Dict])
-```
-
-Where
-
-`"key"`: The first element is a string identifier for the batch.
-Think of this as a “batch ID” that helps to organize and keep
-track of which group of entries belong together.
-Every batch you send to the sink has a unique key or identifier.
-
-`List[Dict]`: The second element is a list of dictionaries.
-Each dictionary represents an individual data record,
-with each key-value pair in the dictionary representing
-fields and their corresponding values.
-
-Together, the tuple tells the sink: “Here is a batch of data,
-labeled with a specific key, and this batch contains multiple data entries.”
-
-This format is designed to let the sink write data efficiently in batches,
-rather than handling each entry one-by-one. By grouping data entries
-together with an identifier, the sink can:
-
-Optimize Writing: Batching data reduces the frequency of writes to
-the database or file, which can dramatically improve performance,
-especially when processing high volumes of data.
-
-Ensure Atomicity: By writing a batch as a single unit,
-we minimize the risk of partial writes, ensuring either the whole
-batch is written or none at all. This is especially important for
-maintaining data integrity.
-
 Usage:
 
 ```
@@ -62,8 +25,8 @@ def create_dict(value: int) -> Tuple[str, Dict[str, Union[int, str]]]:
     name = random.choice(names)
     age = random.randint(20, 60)  # Random age between 20 and 60
     location = random.choice(locations)
-    # The following marks batch '1' to be written to the database
-    return ("1", {"id": value, "name": name, "age": age, "location": location})
+    # This tuple represents the key and the dictionary to be written
+    return (str(value), {"id": value, "name": name, "age": age, "location": location})
 
 # Generate input data
 inp = op.input("inp", flow, TestingSource(range(50)))