Apache Feather #355

ian-k · 2023-04-14T19:41:29Z

ian-k
Apr 14, 2023

Hello,
I may be missing something, but after inspecting the code, it seems to me that the current implementation does not make use of Apache Arrow Zero Copy support when reading files. Am I mistaken? Can anyone comment?
Thank you

koperagen · 2023-04-15T00:54:49Z

koperagen
Apr 15, 2023
Maintainer

Hi! It doesn't. At this point it's just an ability to read data from this format. In order to leverage its potential to be zero-copy, we'll have to replace the underlying data storage from lists to Apache Arrow vectors if I'm not mistaken. There's a big question how it should be done and what tradeoffs it implies, because Apache Arrow is off-heap memory management and will require end users to manage allocators to perform operations.
We're interested in this feature though and will get to it eventually. In the meantime, feedback about your uses for it is welcomed :)

edit. Actually, maybe zero-copy reading (not writing) can be done more easily, hm.. Like, if a special type of value columns backed by vectors is implemented, we can avoid allocations when reading dataframe. But every operation will still allocate lists when it needs to create a new column. How does it sound?

4 replies

ian-k Apr 15, 2023
Author

Currently, Apache Arrow is one of the most if not the number one best performing format, even without zero copy the gains are very noticeable. So I think first-order support for Arrow is very important for the wider adoption of this DataFrame. (My use cases include reading/writing files to S3, and distributing data over the HazelCast data grid; once we moved to Arrow we got 3 to 10 fold performance gains). So my line of thinking is to provide a kind of DataFrame with special, 'Arrow' column types even if it means that some DataFrame capabilities will not be supported. Or perhaps a lazy transition to regular DataFrame when such capabilities are requested.
So yes, it is almost exactly the same as you proposed, I'm not sure about the need to 'allocate lists' as new columns can be created also as 'Arrow' type

Jolanrensen Apr 17, 2023
Maintainer

However, as seen in #358, there are issues with Arrow and other large-scale processing libraries in terms of compatibility for all platforms. Hence we cannot "just switch" and need to be very careful. DataFrame mostly focusses on providing the best API possible in Kotlin first and large-scale processing comes second.
However, indeed, the speed benefits look very enticing :)

pacher Apr 17, 2023

@Jolanrensen but as seen in #24 kdf is not trying to support multiplatform and it is JVM only. Does it really have much use on Android? Maybe it is worth to have a separate dataframe-light for Android?

koperagen Apr 28, 2023
Maintainer

I think switching entirely to allocation in off-heap memory will force us to switch model from eager computations where each operator produces dataframe to lazy ones where you need to call "collect" to get the disposable result, and end users will have to deal with resource management on their end. Without the lazy model, it's not clear how to manage memory allocated for intermediate results. That's why i thought about more "light weight" support only for reading data. But that's just my initial thoughts on the matter. Maybe after more thorough research, we'll find a workaround for these problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Feather #355

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Apache Feather #355

ian-k Apr 14, 2023

Replies: 1 comment · 4 replies

koperagen Apr 15, 2023 Maintainer

ian-k Apr 15, 2023 Author

Jolanrensen Apr 17, 2023 Maintainer

pacher Apr 17, 2023

koperagen Apr 28, 2023 Maintainer

ian-k
Apr 14, 2023

Replies: 1 comment 4 replies

koperagen
Apr 15, 2023
Maintainer

ian-k Apr 15, 2023
Author

Jolanrensen Apr 17, 2023
Maintainer

koperagen Apr 28, 2023
Maintainer