Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.parse() performance issue for wide DF #849

Closed
Jolanrensen opened this issue Aug 29, 2024 · 0 comments · Fixed by #874
Closed

DataFrame.parse() performance issue for wide DF #849

Jolanrensen opened this issue Aug 29, 2024 · 0 comments · Fixed by #874
Assignees
Labels
performance Something related to how fast the library can handle data
Milestone

Comments

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Aug 29, 2024

We noticed this issue especially when parsing DataFrames with lots of String columns, such as a wide CSV file.

If you run DataFrame.parse(), each column is getting parsed one at a time.

If a column has type String, then tryParse goes over each parser in Parsers and if any of the values of the column cannot be parsed it will try the next parser.

These parsers are ordered like Int -> Long -> Instant -> LocalDateTime -> ... -> Json -> String. Which means that for "normal" string columns that need no parsing, at least one cell in the column is attempted to be parsed 17 times. Many of these attempts are achieved by catchSilent {} blocks, which catch any exception thrown an return null if they do. And they are heavy: https://www.baeldung.com/java-exceptions-performance.

This is easily measurable by creating two wide dataframes, one with columns that can be parsed to ints and another with cols that cannot be parsed and must remain strings:

image

We can see that parsing the wide string DF takes a considerable amount of time more.

image

And this is mostly due to Instant.parse, toLocalDateTimeOrNull, etc., and, most importantly, all fillInStackTrace calls at the top of the graph, a.k.a the exceptions of the parsers. We might be able to improve this :)

image

Looking at the parsers there are some interesting observations and possible solutions:

  • Kotlin's Instant.parse() is a lot slower than Java's. We should use Java's and .toKotlin it.
  • Many parsers are duplicates, like Java's LocalDateTime and Kotlin's. If a String can be parsed as date time, it will pick Kotlin's every time. If it cannot, it will fail both the Kotlin and Java one, creating a useless exception. We should drop the java duplicates.
  • Exceptions are heavy. toIntOrNull and toLongOrNull are so fast, the time isn't even shown. If a library offers a canParse() function, we should use it.
  • We should try to parallelize the parsing. Columns don't depend on each other. Parsing is built on convert to which is built on replace with, so that's where the parallelization should occur. Relevant issue: Parallel computations #723
@Jolanrensen Jolanrensen added the performance Something related to how fast the library can handle data label Aug 29, 2024
@Jolanrensen Jolanrensen added this to the 0.15.0 milestone Aug 29, 2024
@Jolanrensen Jolanrensen self-assigned this Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Something related to how fast the library can handle data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant