The data_profiler package is based on the dbt hub/dbt_profiler. Profiling for a table with more than 150 fields can be done more efficiently with the data_profiler package.
data_profiler
uses dbt macros to profile database relationships and table schemas (schema.yml
).
For each column in a relation, a calculated profile includes the following measures:
database
: Database nameschema
: Schema nametable_name
: Table namecolumn_name
: Name of the columndata_type
: Data type of the columnrow_count
: Column based row countnot_null_count
: Count the not_null values based on columnsnot_null_percentage
: Percentage of column values that are notNULL
(e.g.,0.62
means that 62% of the values are populated while 38% areNULL
)null_count
: Count the null values by column based on columnsnull_percentage
: Percentage of column values that areNOT_NULL
(e.g.,0.55
means that 55% of the values are populated while 45% areNOT_NULL
)distinct_percentage
: Percentage of unique column values (e.g.,1
means that 100% of the values are unique)distinct_count
: Count of unique column valuesis_unique
: True if all column values are uniquemin
: Minimum column valuemax
: Maximum column valueavg
: Average column valueprofiled_at
: Date and time (UTC time zone) of the profiling
✅ Snowflake
✅ Postgres
data_profiler macro (source)
This macro returns a relation profile as a SQL query that can be used in a dbt model. This is handy for previewing relation profiles in dbt Cloud.
target_database
(required): Mention the target output databse name.target_schema
(required): Mention the target output schema name.target_table
(required): Mention the target output table name.source_database
(required): Mention the source table name.source_schema
(optional): Mention the source schema nameexclude_tables
(optional): List of tables to exclude from the profile (default:[]
). Only one ofinclude_tables
andexclude_tables
can be specified at a time.include_tables
(optional): List of tables to include in the profile (default:[]
i.e., all). Only one ofinclude_tables
andinclude_table
can be specified at a time.
Use the dbt run-operation
or dbt model
to create a profiling table. dbt run-operation
is recommended only if single database need to be profiled or else dbt model
can be used for profiling the multiple databases.
Single database handling use the following run-operation command,
dbt run-operation data_profiler.data_profiling --args "{target_database: <target datbase>, target_schema: <target schema>, target_table: <target table>, source_database: <source database>, source_schema: [<source schema 1>, <source schema 2>...], exclude_tables: [<exclude table 1>, <exclude table 2>...], include_tables: [<include table 1>, <include table 2>...]}" --target <target name>
For the mutiple database handling use dbt model method,
{{ data_profiler.data_profiling(target_database = '<target database>'
, target_schema = '<target schema>'
, target_table = '<taregt table>'
, source_database = '<source database>'
, source_schema = ['<source schema 1>', '<source schema 2>'...]
, exclude_tables = ['<exclude table 1>', '<exclude table 2>'...]
, include_tables = ['<include table 1>', '<include table 2>'...]) }}
The two tables will be generated by the above model. A temporary table with no data and a target table with profiled data.
If a temporary table is not required, it can be avoided using materialisation or hook.
{{
config(
materialized='ephemeral'
)
}}
OR
post-hook:
- "DROP TABLE IF EXISTS <model name / temporay table name>"
| DATABASE | SCHEMA | TABLE_NAME | COLUMN_NAME | DATA_TYPE | ROW_COUNT | NOT_NULL_COUNT | NULL_COUNT | NOT_NULL_PERCENTAGE| NULL_PERCENTAGE | DISTINCT_COUNT | DISTINCT_PERCENT |IS_UNIQUE | MIN | MAX | AVG | PROFILED_AT |
| ----------------------| -------| -----------|-------------------- | ----------|---------- | -------------- | --------- | ------------------ | --------------- | -------------- | -----------------| -------- | ---------|------------|------------------|----------------------------|
|SNOWFLAKE_SAMPLE_DATA |TPCH_SF1| ORDERS | O_ORDERKEY | NUMBER |1500000 |1500000 | 0 | 100.00 |0.00 |1500000 |100.00 |TRUE |1 |6000000 |2999991.50 |2022-12-06T09:05:18.183Z |
|SNOWFLAKE_SAMPLE_DATA |TPCH_SF1| ORDERS | O_CUSTKEY | NUMBER |1500000 |1500000 | 0 | 100.00 |0.00 |99996 |6.67 |FALSE |1 |149999 |75006.04 |2022-12-06T09:05:18.183Z |
|SNOWFLAKE_SAMPLE_DATA |TPCH_SF1| ORDERS | O_ORDERSTATUS | VARCHAR |1500000 |1500000 | 0 | 100.00 |0.00 |3 |0.00 |FALSE |null |null |null |2022-12-06T09:05:18.183Z |
|SNOWFLAKE_SAMPLE_DATA |TPCH_SF2| CUSTOMERS | O_TOTALPRICE | NUMBER |1500000 |1500000 | 0 | 100.00 |0.00 |1464556 |97.64 |FALSE |857.71 |555285.16 |151219.54 |2022-12-06T09:05:18.183Z |
|SNOWFLAKE_SAMPLE_DATA |TPCH_SF2| CUSTOMERS | O_ORDERDATE | DATE |1500000 |1500000 | 0 | 100.00 |0.00 |2406 |0.16 |FALSE |1992-01-01|1998-08-02 |null |2022-12-06T09:05:18.183Z |
|SNOWFLAKE_SAMPLE_DATA |TPCH_SF2| CUSTOMERS | O_ORDERPRIORITY | VARCHAR |1500000 |1500000 | 0 | 100.00 |0.00 |5 |0.00 |FALSE |null |null |null |2022-12-06T09:05:18.183Z |
|SNOWFLAKE_SAMPLE_DATA |TPCH_SF3| DELIVERY | O_CLERK | VARCHAR |1500000 |1500000 | 0 | 100.00 |0.00 |1000 |0.07 |FALSE |null |null |null |2022-12-06T09:05:18.183Z |
|SNOWFLAKE_SAMPLE_DATA |TPCH_SF3| DELIVERY | O_SHIPPRIORITY | NUMBER |1500000 |1500000 | 0 | 100.00 |0.00 |1 |0.00 |FALSE |0 |0 |0.00 |2022-12-06T09:05:18.183Z |
|SNOWFLAKE_SAMPLE_DATA |TPCH_SF3| DELIVERY | O_COMMENT | VARCHAR |1500000 |1500000 | 0 | 100.00 |0.00 |1482071 |98.80 |FALSE |null |null |null |2022-12-06T09:05:18.183Z |