Replies: 1 comment
-
Hi. Any found recomendations when working with parquet files? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am excited to try out this feature but I wanted to clear a few things up.
I am using partitioned Parquet datasets for everything currently. I can register the dataset as the base path such as
abfs://container/dag_id/task_id/dag_version
, and not worry about specific files or partitions, correct? Some of my datasets have thousands of files within, and there are cleanup tasks which consolidate small files into one file per partition (so filenames change and are not consistent).I am also curious if there are plans for downstream datasets to be able to conveniently pull XComs from upstream datasets. For example, if my dataset is 100 million rows then I only want to process new/changed data. My
outlet
dataset can return an XCom from one of the tasks with the partitions I need to process, then myschedule
downstream dataset can use that as a filter. This isn't a dealbreaker, but it seems like it would be nice to refer to the dataset instead of the dag_id, task_id, etc of the upstream dataset.Example
If I download data from the Zendesk Incremental Tickets API, I pull any updated ticket since the last schedule. These tickets could have been originally created months ago which means that I need to reprocess old partitions to ensure we reflect the latest version of each id.
This particular dataset is large with around 100 million rows. However, based on the XCom, I know that the only data which has changed is a few dates (09/01, 09/04, 09/19, 09/20) so only those partitions need to be processed.
Beta Was this translation helpful? Give feedback.
All reactions