Pipelines / Transform functionality discussion #720

rufuspollock · 2020-11-11T07:27:54Z

rufuspollock
Nov 11, 2020
Maintainer

We're planning to have some kind of recommended pipeline / transform system(s) for Frictionless toolkit.

@roll i (rufus) am opening this issue so we have a place to discuss plans for transform / pipelines functionality. I think this is something worth discussing a bit and maybe even speccing in an RFC.

This is an It's something @rufuspollock has now been involved with implementing several times (including dataflows and a new system called AirCan https://github.com/datopian/aircan) - we may want to think whether we directly reuse this or build anew. It is also an area where there is a lot of existing open source tooling so it is worth thinking what we reuse vs what we build ourselves.

Tasks

Summarize existing work
Identify the gaps / opportunities / challenges
Describe plan of action in e.g. RFC
Sign off

Analysis

Existing work (related to frictionless)

https://github.com/datahq/dataflows
frictionless-py @roll perhaps you could summarize what is happening here as i saw some issues e.g. Extend and mature transform functionality frictionless-py#389
https://github.com/datopian/aircan - frictionless-oriented data pipeline system built on top of AirFlow (i.e. pipelines are DAGs in AirFlow). See also https://tech.datopian.com/flows/
(Deprecated) Data Factory in DataHub

Existing work (non frictionless) - see also https://tech.datopian.com/flows/research/

AirFlow
Luigi
Petl
...

Please preserve this line to notify @lwinfree (lead of this repository)

rufuspollock · 2020-11-11T07:28:16Z

rufuspollock
Nov 11, 2020
Maintainer Author

@roll @risenW ^^^

0 replies

roll · 2020-11-11T07:40:23Z

roll
Nov 11, 2020
Maintainer

@rufuspollock @risenW
Frictionless for Python now supports:

core transform functionality powered by PETL (ATM wrapped near 100 PETL operations)
a plugin that can run DataFlows pipelines

https://colab.research.google.com/drive/1C4dFWDExyxzGIwLUovrDQZghZK4JK2PD

0 replies

rufuspollock · 2020-11-11T09:24:10Z

rufuspollock
Nov 11, 2020
Maintainer Author

@roll great and i've read through that. What do you think of doing a bit of syncing / planning before we proceed much further?

Also some questions:

Looking at the code it looks like you export the resource to be used by PETL (which seems very sensible) rather than wrapping. Is that right?
Does it run in streaming or non streaming mode?

0 replies

roll · 2020-11-12T06:15:03Z

roll
Nov 12, 2020
Maintainer

@rufuspollock
Yes, of course.

Another question is that in my opinion, I don't think it should be driven by some kind of committee 😃. Honestly, I don't know if it's a good enough solution or not until real people start using it in real projects and give us feedback.

For example, as far as I can remember, this project https://github.com/datasets/covid-19 was driven by DataFlows initially and it uses Pandas now. I guess you ran into some shortcomings of the pipeline and had to switch. Also, it's very interesting whether we will be able to prototype something like this using Frictionless Transform.

Looking at the code it looks like you export the resource to be used by PETL (which seems very sensible) rather than wrapping. Is that right?

I use a pretty simple adapter that makes Resource to be compatible with PETL Table Container interface - https://petl.readthedocs.io/en/stable/intro.html#conventions-table-containers-and-table-iterators. It allowed to just re-use all their battle-tested processors for data and actually write only metadata updates in our processors. Although we fully wrap PETL as a project so our users don't need to go to their documentation once we have finished ours.

Does it run in streaming or non streaming mode?

It's streaming

0 replies

lwinfree · 2022-03-28T18:32:12Z

lwinfree
Mar 28, 2022

hi @roll do you think we can close this? has it been resolved in frictionless-py?

1 reply

roll Mar 29, 2022
Maintainer

@lwinfree It's hard to say how close it to closing as it feels more like a discussion, I've moved this issue there

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipelines / Transform functionality discussion #720

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Pipelines / Transform functionality discussion #720

rufuspollock Nov 11, 2020 Maintainer

Tasks

Analysis

Replies: 5 comments · 1 reply

rufuspollock Nov 11, 2020 Maintainer Author

roll Nov 11, 2020 Maintainer

rufuspollock Nov 11, 2020 Maintainer Author

roll Nov 12, 2020 Maintainer

lwinfree Mar 28, 2022

roll Mar 29, 2022 Maintainer

rufuspollock
Nov 11, 2020
Maintainer

Replies: 5 comments 1 reply

rufuspollock
Nov 11, 2020
Maintainer Author

roll
Nov 11, 2020
Maintainer

rufuspollock
Nov 11, 2020
Maintainer Author

roll
Nov 12, 2020
Maintainer

lwinfree
Mar 28, 2022

roll Mar 29, 2022
Maintainer