Skip to content

Commit

Permalink
address pr comments
Browse files Browse the repository at this point in the history
  • Loading branch information
PedramNavid committed Oct 2, 2023
1 parent 1fd4ba3 commit 0204dd4
Showing 1 changed file with 54 additions and 45 deletions.
99 changes: 54 additions & 45 deletions docs/content/integrations/embedded-elt.mdx
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
---
title: "Dagster Embedded ELT"
description: Lightweight ELT with Dagster
description: Lightweight ELT framework for building ELT pipelines with Dagster, through helpful pre-built assets and resources
---

# Dagster Embedded ELT

This package provides a framework for building ELT pipelines with Dagster through helpful pre-built assets and resources. It is currently in experimental development, and we'd love to hear your feedback.

This package currently includes a single implementation using <a href="https://slingdata.io">Sling</a> provides a simple way to sync data between databases and file systems.
This package currently includes a single implementation using <a href="https://slingdata.io">Sling</a>, which provides a simple way to sync data between databases and file systems.

We plan on adding additional embedded ELT tool integrations in the future.

---

## Relevant APIs

| Name | Description |
Expand All @@ -20,78 +22,52 @@ We plan on adding additional embedded ELT tool integrations in the future.

---

## Getting Started

To get started with `dagster-embedded-elt` and Sling, first familiarize yourself with <a href="https://docs.slingdata.io/sling-cli/running-tasks">Sling's documentation</a>

First, setup a <PyObject module="dagster-embedded-elt.sling" object="SlingResource" /> which defines the source and destination credentials. The <PyObject module="dagster-embedded-elt.sling" object="SlingResource" /> requires both a <PyObject module="dagster-embedded-elt.sling" object="SlingSourceConnection" /> and a <PyObject module="dagster-embedded-elt.sling" object="SlingTargetConnection" />.
## Getting started

You can provide either a connection string or a dictionary of connection parameters to each of these classes.
To get started with `dagster-embedded-elt` and Sling, familiarize yourself with <a href="https://docs.slingdata.io/sling-cli/running-tasks">Sling</a> by reading their docs which describe how sources and targets are configured.

For details on what connection parameters Sling accepts for each integration, see the <a href="https://docs.slingdata.io/connections/database-connections">Sling Connections</a> page.
The typical pattern for building an ELT pipeline with Sling has three steps:

To create an Asset that syncs between two connections, you can use the <PyObject module="dagster-embedded-elt.sling" object="build_sling_asset" /> factory.
1. First, create a <PyObject module="dagster-embedded-elt.sling" object="SlingResource" /> which is a container for the source and the target.
2. In the <PyObject module="dagster-embedded-elt.sling" object="SlingResource" /> define both a <PyObject module="dagster-embedded-elt.sling" object="SlingSourceConnection" /> and a <PyObject module="dagster-embedded-elt.sling" object="SlingTargetConnection" /> which holds the source and target credentials that Sling will use to sync data.
3. Finally, create an asset that syncs between two connections. You can use the <PyObject module="dagster-embedded-elt.sling" object="build_sling_asset" /> factory for most use cases.

---

## Setting up a Sling Resource
## Step 1: Setting up a Sling resource

A Sling Resource is a Dagster Resource that contains references to both a Source Connection and a Target Connection. Sling is versatile in what a source or destination can represent. You can provide arbitrary keywords to the <PyObject module="dagster-embedded-elt.sling" object="SlingSourceConnection" /> and <PyObject module="dagster-embedded-elt.sling" object="SlingTargetConnection" /> classes.
A Sling resource is a Dagster resource that contains references to both a source connection and a target connection. Sling is versatile in what a source or destination can represent. You can provide arbitrary keywords to the <PyObject module="dagster-embedded-elt.sling" object="SlingSourceConnection" /> and <PyObject module="dagster-embedded-elt.sling" object="SlingTargetConnection" /> classes.

The types and parameters for each connection are defined by <a href="https://docs.slingdata.io/connections/database-connections">Sling's Connections</a>.
The types and parameters for each connection are defined by [Sling's connections](https://docs.slingdata.io/connections/database-connections).

The simplest connection is a file connection, which can be defined as:

```python
from dagster_embedded_elt.sling import SlingSourceConnection
source = SlingSourceConnection(type="file")
sling = SlingResource(source_connection=source, ...)
```

Note that no path is required, as that is provided by the asset itself.
Note that no path is required in the source connection, as that is provided by the asset itself.

```python
````python
asset_def = build_sling_asset(
asset_spec=AssetSpec("my_file"),
source_stream=f"file://{path_to_file}",
...
)

```

For database connections, you can provide a connection string or a dictionary of keyword arguments. For example, to connect to a SQLite database, you can provide an path to the database using the `instance` keyword, which is specified on the <a href="https://docs.slingdata.io/connections/database-connections/sqlite">Sqlite Connection</a> page.
For database connections, you can provide a connection string or a dictionary of keyword arguments. For example, to connect to a SQLite database, you can provide a path to the database using the `instance` keyword, which is specified in [Sling's SQLite connection](https://docs.slingdata.io/connections/database-connections/sqlite) documentation.

```python
source = SlingSourceConnection(type="sqlite", instance="path/to/sqlite.db")
```

Here are some additional examples of database connections:

```python
source = SlingSourceConnection(
type="postgres", host="localhost", port=5432, database="my_database",
user="my_user", password=EnvVar("PG_PASS")
)

source = SlingSourceConnection(
type="snowflake", host="hostname.snowflake", user="username",
database="database", password=EnvVar("SF_PASSWORD"), role="role"
)
```

Similarily, you can define file/storage connections:

```python
source = SlingSourceConnection(
type="s3", bucket="sling-bucket",
access_key_id=EnvVar("AWS_ACCESS_KEY_ID"),
secret_access_key=EnvVar("AWS_SECRET_ACCESS_KEY")
)
````

```
---

## Creating a Sling Sync
## Step 2: Creating a Sling sync

To create a Sling Sync, once you have defined your Resource, you can use the <PyObject module="dagster_embedded_elt.sling" object="build_sling_asset" /> factory to create an Asset.
To create a Sling sync, once you have defined your resource, you can use the <PyObject module="dagster_embedded_elt.sling" object="build_sling_asset" /> factory to create an asset.

```python

Expand Down Expand Up @@ -124,3 +100,36 @@ sling_job = build_assets_job(
)

```

---

## Examples

This is an example of how to setup a Sling sync between Postgres and Snowflake:

```python
import os
from dagster_embedded_elt.sling import SlingResource, SlingSourceConnection, SlingTargetConnection

source = SlingSourceConnection(
type="postgres", host="localhost", port=5432, database="my_database",
user="my_user", password=os.getenv("PG_PASS")
)

target = SlingTargetConnection(
type="snowflake", host="hostname.snowflake", user="username",
database="database", password=os.getenv("SF_PASSWORD"), role="role"
)

sling = SlingResource(source_connection=source, target_connection=target)
```

Similarily, you can define file/storage connections:

```python
source = SlingSourceConnection(
type="s3", bucket="sling-bucket",
access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY")
)
```

0 comments on commit 0204dd4

Please sign in to comment.