Skip to content

Latest commit

 

History

History
21 lines (16 loc) · 1.6 KB

README.md

File metadata and controls

21 lines (16 loc) · 1.6 KB

Data Pipelines of NY Taxi Data - Using Dagster as Ochestrator

In this project, I created and orchestrated pipelines using Dagster as the ochestrator.

My raw data source is TLC Trip Record Data.
I used DuckDB as a database.
I used the local file system to store some transformed data saved as CSVs and Parquet
I used pandas and pyplot to analyse the data, generate charts and save them as PNGs in the local filesystem.

The completed Running pipeline is seen below Completed Pipeline from dagster-webserver

Projects Description on how dagster was utilized

  • Overall I had two data sources and three data sinks.
    With Dagster, I defined all sources, sinks and external connections as dagster resources
  • I defined each task as an asset so they can be materialized independently and combined into jobs
  • I made different jobs from different combinations of assets in the pipeline so as to help schedule them at different intervals
  • I used sensors for event driven orchestration. In this case, a particular jobs kicks of when a user uploads or modify a file in the ./data/requests/ directory
  • I configured the sensors to run with different parameters depending on the contents of the file uploaded
  • I made my orchestration stateful by storing and retrieving tracking data from cursor property provided by the sensor context