Data Pipelines of NY Taxi Data - Using Dagster as Ochestrator

In this project, I created and orchestrated pipelines using Dagster as the ochestrator.

My raw data source is TLC Trip Record Data.
I used DuckDB as a database.
I used the local file system to store some transformed data saved as CSVs and Parquet
I used pandas and pyplot to analyse the data, generate charts and save them as PNGs in the local filesystem.

The completed Running pipeline is seen below Completed Pipeline from dagster-webserver

Projects Description on how dagster was utilized

Overall I had two data sources and three data sinks.
With Dagster, I defined all sources, sinks and external connections as dagster resources
I defined each task as an asset so they can be materialized independently and combined into jobs
I made different jobs from different combinations of assets in the pipeline so as to help schedule them at different intervals
I used sensors for event driven orchestration. In this case, a particular jobs kicks of when a user uploads or modify a file in the ./data/requests/ directory
I configured the sensors to run with different parameters depending on the contents of the file uploaded
I made my orchestration stateful by storing and retrieving tracking data from cursor property provided by the sensor context

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data Pipelines of NY Taxi Data - Using Dagster as Ochestrator

Projects Description on how dagster was utilized

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data Pipelines of NY Taxi Data - Using Dagster as Ochestrator

Projects Description on how dagster was utilized