Data aware scheduling #26502

ldacey · 2022-09-19T19:00:46Z

ldacey
Sep 19, 2022

I am excited to try out this feature but I wanted to clear a few things up.

I am using partitioned Parquet datasets for everything currently. I can register the dataset as the base path such as abfs://container/dag_id/task_id/dag_version, and not worry about specific files or partitions, correct? Some of my datasets have thousands of files within, and there are cleanup tasks which consolidate small files into one file per partition (so filenames change and are not consistent).
I am also curious if there are plans for downstream datasets to be able to conveniently pull XComs from upstream datasets. For example, if my dataset is 100 million rows then I only want to process new/changed data. My outlet dataset can return an XCom from one of the tasks with the partitions I need to process, then my schedule downstream dataset can use that as a filter. This isn't a dealbreaker, but it seems like it would be nice to refer to the dataset instead of the dag_id, task_id, etc of the upstream dataset.

Example

If I download data from the Zendesk Incremental Tickets API, I pull any updated ticket since the last schedule. These tickets could have been originally created months ago which means that I need to reprocess old partitions to ensure we reflect the latest version of each id.

@dag
def zendesk_dag():
    @task
    def extract(outlets=[Dataset("abfs://zendesk/tickets/extract/v1")]):
        ...
       # return list of paths (dataset fragments) which were saved, such as:
       # ["abfs://zendesk/tickets/extract/v1/date_id=20220901/file.parquet","abfs://zendesk/tickets/extract/v1/date_id=20220904/file.parquet","abfs://zendesk/tickets/extract/v1/date_id=20220919/file.parquet","abfs://zendesk/tickets/extract/v1/date_id=20220920/file.parquet"]

This particular dataset is large with around 100 million rows. However, based on the XCom, I know that the only data which has changed is a few dates (09/01, 09/04, 09/19, 09/20) so only those partitions need to be processed.

@dag
def open_ticket_dag():
    tickets = Dataset("abfs://zendesk/tickets/extract/v1")

    @task
    def process(schedule=[tickets]):
        ...
	# some way to get the list of files which were returned by the outlet dataset?
	# files = tickets.xcom(task_id=“extract”)

ArtemioPadilla · 2022-11-17T06:13:45Z

ArtemioPadilla
Nov 17, 2022

Hi. Any found recomendations when working with parquet files?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data aware scheduling #26502

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Data aware scheduling #26502

ldacey Sep 19, 2022

Replies: 1 comment

ArtemioPadilla Nov 17, 2022

ldacey
Sep 19, 2022

ArtemioPadilla
Nov 17, 2022