Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] [external-assets] - Round 2 #17177

Merged
merged 5 commits into from
Oct 13, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 91 additions & 45 deletions docs/content/concepts/assets/external-assets.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,47 +5,42 @@ description: External assets model assets in Dagster that are not scheduled or m

# External Assets (Experimental)

An **external asset** is an asset that is not materialized by Dagster, but is tracked in the asset graph and asset catalog. This allows you to model assets in Dagster, attach metadata and events to those assets, but without scheduling their materialization with Dagster.
An **external asset** is an asset that is visible in Dagster but executed by an external process. For example, you have a process that loads data from Kafka into Amazon S3 every day. You want the S3 asset to be visible alongside your other data assets, but not triggered by Dagster.

**External assets are a good fit when data is**:
In this case, you could use an external asset to leverage Dagster's event log and tooling without using the orchestrator. This allows you to maintain data lineage, observability, and data quality without unnecessary migrations.

- Landed by an external source (e.g. an external file landing daily; Kafka landing data into Amazon S3)
- Created and processed using manual processes
- Materialized by existing pipelines with their own scheduling and infrastructure that you do not want to or need to migrate en masse
### What about Source Assets?

**With an external asset, you can:**
[Source Assets](/concepts/assets/software-defined-assets#defining-external-asset-dependencies) can be used to model data that's produced by a process Dagster doesn't control, such as a daily file drop into Amazon S3.

- Attach metadata to its definition for documentation, tracking ownership, and so on
- Track its data quality and version in Dagster
External assets can accomplish this, and more. As a result, Source Assets will be replaced with external assets in the near future.

---

## Uses and limitations

Using external assets, you can:

- Attach metadata to asset definitions for documentation, tracking ownership, and so on
- Track the assets' [data quality](/concepts/assets/asset-checks) and [version](/guides/dagster/asset-versioning-and-caching) in Dagster
- Use [asset sensors](/concepts/partitions-schedules-sensors/asset-sensors) or auto-materialize policies to update downstream assets based on updates to external assets

**You cannot, however:**

- Schedule an external asset's materialization
- Backfill an external asset using Dagster
- Use the [Dagster UI](/concepts/webserver/ui) or [GraphQL API](/concepts/webserver/graphql) to instigate ad hoc materializations

<Note>
<strong>What about Source Assets?</strong> A common use case for external
assets is modeling data produced by a process not under Dagster's control. For
example, a daily file drop from a third party into Amazon S3. In most systems,
these are described as <strong>sources</strong>. This includes Dagster, which
includes <PyObject object="SourceAsset" displayText="SourceAsset" />. As
external assets are a superset of Source Asset functionality,{" "}
<strong>
source assets will be supplanted by external assets in the near future
</strong>
.
</Note>
### Limitations

The following aren't currently supported when using external assets:

- Scheduling the execution of an external asset
- Backfilling an external asset using Dagster
- Using the [Dagster UI](/concepts/webserver/ui) or [GraphQL API](/concepts/webserver/graphql) to instigate ad hoc executions

---

## Relevant APIs

| Name | Description |
| ------------------------------------------------ | ------------------------------------------------------------------------------------------- |
| <PyObject object="external_assets_from_specs" /> | Create list of <PyObject object="AssetsDefinition"/> objects that represent external assets |
| <PyObject object="AssetSpec" /> | An object that represents the metadata of a particular asset |
| Name | Description |
| ------------------------------- | ---------------------------------------------------------------------------------------------- |
| `external_assets_from_specs` | Creates a list of <PyObject object="AssetsDefinition"/> objects that represent external assets |
| <PyObject object="AssetSpec" /> | An object that represents the metadata of a particular asset |

---

Expand All @@ -71,6 +66,8 @@ defs = Definitions(assets=[external_asset_from_spec(AssetSpec("file_in_s3"))])

Click the **Asset definition** tab to view how this asset is defined.

Note that the **Materialize** button is disabled, as external assets can't be executed by Dagster.

<Image
alt="The files_in_s3 external asset in the Asset Graph of the Dagster UI"
src="/images/concepts/assets/external-asset.png"
Expand Down Expand Up @@ -112,6 +109,8 @@ defs = Definitions(assets=external_assets_from_specs([raw_logs, processed_logs])

Click the **Asset definitions** tab to view how these assets are defined.

Note that the **Materialize** button is disabled, as external assets can't be executed by Dagster.

<Image
alt="External assets with dependencies in the Dagster UI"
src="/images/concepts/assets/external-assets-show-detail.png"
Expand All @@ -124,7 +123,7 @@ height={1654}
</TabItem>
</TabGroup>

### Fully-managed assets with external asset dependencies
### Dagster-native assets with external asset dependencies
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on this terminology?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't think of anything better atm. "Dagster-managed"? "Dagster-materialized"? Listing out the actual decorators "@asset, @multi_asset, and @graph_asset with external ..."?


Fully-managed assets can depend on external assets. In this example, the `aggregated_logs` asset depends on `processed_logs`, which is an external asset:

Expand Down Expand Up @@ -176,21 +175,44 @@ To keep your external assets updated, you can use any of the following approache

- [A REST API](#using-the-rest-api)
- [Sensors](#using-sensors)
- [Using the Python API](#using-the-python-api)
- [Logging events in ops](#logging-events-in-unrelated-ops)
- [A Python API](#using-the-python-api)
- [Logging events using ops](#logging-events-using-ops)

### Using the REST API

Dagster OSS exposes a REST endpoint for reporting asset materializations. Refer to the following tabs for examples using a `curl` command, and for invoking the API in Python.
Whether you're using Dagster OSS or Dagster Cloud, you can use a REST endpoint for reporting asset materializations. The API also has endpoints for reporting [asset observations](/concepts/assets/asset-observations) and [asset check evaluations](/concepts/assets/asset-checks).

Refer to the following tabs for examples using `curl` and Python to communicate with the API.

#### Using curl

<TabGroup>
<TabItem name="Using curl">
<TabItem name="Dagser Cloud">

##### Dagster Cloud

```bash
curl --request POST \
--url https://{organization}.dagster.cloud/{deployment}/report_asset_materialization/{asset_key} \
--header 'Content-Type: application/json' \
--header 'Dagster-Cloud-Api-Token: {token}' \
--data '{
"metadata" : {
"source": "From curl command"
}
}'
```

---

The following demonstrates how to use a `curl` command in a shell script to communicate with the API:
</TabItem>
<TabItem name="Dagster OSS">

##### Dagster OSS

```bash
curl --request POST \
--url https://path/to/instance/report_asset_materialization/{asset_key}\
--url https://{dagster_webserver_host}/report_asset_materialization/{asset_key} \
--header 'Content-Type: application/json' \
--data '{
"metadata" : {
Expand All @@ -199,26 +221,50 @@ curl --request POST \
}'
```

---

</TabItem>
</TabGroup>

#### Using Python

<TabGroup>
<TabItem name="Dagster Cloud">

##### Dagster Cloud

```python
import requests

url = f"https://{organization}.dagster.cloud/{deployment}/report_asset_materialization/{asset_key}"
payload = { "metadata": { "source": "From python script" } }
headers = { "Content-Type": "application/json", "Dagster-Cloud-Api-Token": "{token}" }

response = requests.request("POST", url, json=payload, headers=headers)
```

---

</TabItem>
<TabItem name="Using Python">
<TabItem name="Dagster OSS">

The following demonstrates how to invoke the API in Python using the `requests` library:
##### Dagster OSS

```python
import requests

url = f"https://path/to/instance/report_asset_materialization/{asset_key}"
url = f"https://{dagster_webserver_host}/report_asset_materialization/{asset_key}"
payload = { "metadata": { "source": "From python script" } }
headers = { "Content-Type": "application/json" }

response = requests.request("POST", url, json=payload, headers=headers)
```

---

</TabItem>
</TabGroup>

The API also has endpoints for reporting [asset observations](/concepts/assets/asset-observations) and [asset check evaluations](/concepts/assets/asset-checks).
erinkcochran87 marked this conversation as resolved.
Show resolved Hide resolved

### Using sensors

By using the `asset_events` parameter of <PyObject object="SensorResult" />, you can generate events to attach to external assets and then provide them directly to sensors. For example:
Expand Down Expand Up @@ -266,7 +312,7 @@ defs = Definitions(

You can insert events to attach to external assets directly from Dagster's Python API. Specifically, the API is `report_runless_asset_event` on <PyObject object="DagsterInstance" />.

For example, this would be useful when writing a hand-rolled Python script to backfill metadata:
For example, this would be useful when writing a Python script to backfill metadata:

```python file=/concepts/assets/external_assets/external_asset_events_using_python_api.py startafter=start_python_api_marker endbefore=end_python_api_marker dedent=4
from dagster import AssetMaterialization
Expand All @@ -279,9 +325,9 @@ instance.report_runless_asset_event(
)
```

### Logging events in unrelated ops
### Logging events using ops

You can log an <PyObject object="AssetMaterialization"/> from a bare op. In this case, use the `log_event` method of <PyObject object="OpExecutionContext"/> to report an asset materialization of an external asset. For example:
You can log an <PyObject object="AssetMaterialization"/> from an [op](/concepts/ops-jobs-graphs/ops). In this case, use the `log_event` method of <PyObject object="OpExecutionContext"/> to report an asset materialization of an external asset. For example:

```python file=/concepts/assets/external_assets/update_external_asset_via_op.py
from dagster import (
Expand Down