Skip to content

Commit

Permalink
[docs] Add table metadata docs (dagster-io#22291)
Browse files Browse the repository at this point in the history
## Summary

Adds docs on how to add new tabular metadata to your assets.

## Test Plan

vercel + local docs
  • Loading branch information
benpankow authored and danielgafni committed Jun 18, 2024
1 parent 96bfb4f commit a0a96aa
Show file tree
Hide file tree
Showing 10 changed files with 215 additions and 3 deletions.
Binary file modified docs/content/api/modules.json.gz
Binary file not shown.
Binary file modified docs/content/api/searchindex.json.gz
Binary file not shown.
Binary file modified docs/content/api/sections.json.gz
Binary file not shown.
6 changes: 3 additions & 3 deletions docs/content/concepts/metadata-tags/asset-metadata.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Dagster supports attaching a few different types of definition metadata:

- [**Arbitrary metadata**](#arbitrary-metadata-using-the-metadata-parameter), such as the storage location of the table produced by the asset
- [**Asset owners**](#asset-owners), which are the people and/or teams who own the asset
- [**Column-level lineage**](#column-level-lineage), which is information about how a column is created and used
- [**Table and column metadata**](#table-and-column-metadata), which provides additional context about a tabular asset, such as its schema or row count

### Arbitrary metadata using the metadata parameter

Expand Down Expand Up @@ -129,9 +129,9 @@ def topstories(context: AssetExecutionContext) -> MaterializeResult:
)
```

### Column-level lineage
### Table and column metadata

For assets that produce database tables, column-level lineage can be a powerful tool for improving collaboration and debugging issues. Column lineage enables data and analytics engineers alike to understand how a column is created and used in your data platform. Refer to the [Column-level lineage documentation](/concepts/metadata-tags/asset-metadata/column-level-lineage) for more information.
For assets which produce database tables, you can attach table metadata to provide additional context about the asset. Table metadata can include information such as the schema, row count, or column lineage. Refer to the [Table metadata documentation](/concepts/metadata-tags/asset-metadata/table-metadata) for more information, or the [Column-level lineage documentation](/concepts/metadata-tags/asset-metadata/column-level-lineage) for specific details on column-level lineage.

---

Expand Down
160 changes: 160 additions & 0 deletions docs/content/concepts/metadata-tags/asset-metadata/table-metadata.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
---
title: "Table metadata | Dagster Docs"
description: "Table metadata can be used to provide additional context about a tabular asset, such as its schema, row count, and more."
---

# Table metadata

Table metadata provides additional context about a tabular asset, such as its schema, row count, and more. This metadata can be used to improve collaboration, debugging, and data quality in your data platform.

Dagster supports attaching different types of table metadata to assets, including:

- [**Column schema**](#attaching-column-schema), which describes the structure of the table, including column names and types
- [**Row count**](#attaching-row-count), which describes the number of rows in a materialized table
- [**Column-level lineage**](#attaching-column-level-lineage), which describes how a column is created and used by other assets

---

## Attaching column schema

### For assets defined in Dagster

Column schema metadata can be attached to Dagster assets either as [definition metadata](/concepts/metadata-tags/asset-metadata#attaching-definition-metadata) or [materialization metadata](/concepts/metadata-tags/asset-metadata#attaching-materialization-metadata), which will then be visible in the Dagster UI. For example:

<Image
alt="Column schema for an asset in the Dagster UI"
src="/images/concepts/metadata-tags/metadata-table-schema.png"
width={1793}
height={652}
/>

If the schema of your asset is pre-defined, you can attach it as definition metadata. If the schema is only known when an asset is materialized, you can attach it as metadata to the materialization.

To attach schema metadata to an asset, you will need to:

1. Construct a <PyObject object="TableSchema"/> object with <PyObject object="TableColumn" /> entries describing each column in the table
2. Attach the `TableSchema` object to the asset as part of the `metadata` parameter under the `dagster/column_schema` key. This can be attached to your asset definition, or to the <PyObject object="MaterializeResult" /> object returned by the asset function.

Below are two examples of how to attach column schema metadata to an asset, one as definition metadata and one as materialization metadata:

```python file=/concepts/metadata-tags/asset_column_schema.py
from dagster import AssetKey, MaterializeResult, TableColumn, TableSchema, asset


# Definition metadata
# Here, we know the schema of the asset, so we can attach it to the asset decorator
@asset(
deps=[AssetKey("source_bar"), AssetKey("source_baz")],
metadata={
"dagster/column_schema": TableSchema(
columns=[
TableColumn(
"name",
"string",
description="The name of the person",
),
TableColumn(
"age",
"int",
description="The age of the person",
),
]
)
},
)
def my_asset(): ...


# Materialization metadata
# Here, the schema isn't known until runtime
@asset(deps=[AssetKey("source_bar"), AssetKey("source_baz")])
def my_other_asset():
column_names = ...
column_types = ...

columns = [
TableColumn(name, column_type)
for name, column_type in zip(column_names, column_types)
]

yield MaterializeResult(
metadata={"dagster/column_schema": TableSchema(columns=columns)}
)
```

The schema for `my_asset` will be visible in the Dagster UI.

### For assets loaded from integrations

Dagster's dbt integration enables automatically attaching column schema metadata to assets loaded from dbt models. Refer to the [dbt documentation](/integrations/dbt/reference#customizing-metadata) for more information.

---

## Attaching row count

Row count metadata can be attached to Dagster assets as [materialization metadata](/concepts/metadata-tags/asset-metadata#attaching-materialization-metadata) to provide additional context about the number of rows in a materialized table. This will be highlighted in the Dagster UI. For example:

<Image
alt="Row count for an asset in the Dagster UI"
src="/images/concepts/metadata-tags/metadata-row-count.png"
width={1921}
height={559}
/>

In addition to showing the latest row count, Dagster will let you track changes in the row count over time, and you can use this information to monitor data quality.

To attach row count metadata to an asset, you will need to attach a numerical value to the `dagster/row_count` key in the metadata parameter of the <PyObject object="MaterializeResult" /> object returned by the asset function. For example:

```python file=/concepts/metadata-tags/asset_row_count.py
import pandas as pd

from dagster import AssetKey, MaterializeResult, asset


@asset(deps=[AssetKey("source_bar"), AssetKey("source_baz")])
def my_asset():
my_df: pd.DataFrame = ...

yield MaterializeResult(metadata={"dagster/row_count": 374})
```

---

## Attaching column-level lineage

Column lineage enables data and analytics engineers alike to understand how a column is created and used in your data platform. Refer to the [Column-level lineage documentation](/concepts/metadata-tags/asset-metadata/column-level-lineage) for more information.

---

## APIs in this guide

| Name | Description |
| -------------------------------------------- | ---------------------------------------------------------------- |
| <PyObject object="asset" decorator /> | A decorator used to define assets. |
| <PyObject object="MaterializeResult" /> | An object representing a successful materialization of an asset. |
| <PyObject object="TableSchema" /> | An object representing the schema of a tabular asset. |
| <PyObject object="TableColumn" /> | Class that defines column information for a tabular asset. |
| <PyObject object="TableColumnConstraints" /> | Class that defines constraints for a column in a tabular asset. |

---

## Related

<ArticleList>
<ArticleListItem
title="Asset metadata"
href="/concepts/metadata-tags/asset-metadata"
></ArticleListItem>
<ArticleListItem
title="Column-level lineage"
href="/concepts/metadata-tags/asset-metadata/column-level-lineage"
></ArticleListItem>
<ArticleListItem
title="Metadata & tags"
href="/concepts/metadata-tags"
></ArticleListItem>
<ArticleListItem
title="Asset definitions"
href="/concepts/assets/software-defined-assets"
></ArticleListItem>
</ArticleList>
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/next/public/objects.inv
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
from dagster import AssetKey, MaterializeResult, TableColumn, TableSchema, asset


# Definition metadata
# Here, we know the schema of the asset, so we can attach it to the asset decorator
@asset(
deps=[AssetKey("source_bar"), AssetKey("source_baz")],
metadata={
"dagster/column_schema": TableSchema(
columns=[
TableColumn(
"name",
"string",
description="The name of the person",
),
TableColumn(
"age",
"int",
description="The age of the person",
),
]
)
},
)
def my_asset(): ...


# Materialization metadata
# Here, the schema isn't known until runtime
@asset(deps=[AssetKey("source_bar"), AssetKey("source_baz")])
def my_other_asset():
column_names = ...
column_types = ...

columns = [
TableColumn(name, column_type)
for name, column_type in zip(column_names, column_types)
]

yield MaterializeResult(
metadata={"dagster/column_schema": TableSchema(columns=columns)}
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
import pandas as pd

from dagster import AssetKey, MaterializeResult, asset


@asset(deps=[AssetKey("source_bar"), AssetKey("source_baz")])
def my_asset():
my_df: pd.DataFrame = ...

yield MaterializeResult(metadata={"dagster/row_count": 374})

0 comments on commit a0a96aa

Please sign in to comment.