Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Revisit multi-get_json_object now that CUDF has better performance #11879

Open
revans2 opened this issue Dec 16, 2024 · 0 comments
Open
Labels
performance A performance related task/issue

Comments

@revans2
Copy link
Collaborator

revans2 commented Dec 16, 2024

Is your feature request related to a problem? Please describe.
We put a lot of work into get_json_object and we were able to speed up a specific customer query by over 3x from the original GPU version we tested, and over 4x from the CPU version.

After that together with the CUDF team we have optimized from_json and JSON Scan significantly. I think it is time for us to revisit multi-get_json_object. If I rewrite this customer query to use from_json where possible we are able to speed up the current CPU implementation by an additional 1.45x making the total GPU speedup closer to 5x than 3x.

Describe the solution you'd like
This is mostly an experiment. We could try and write custom code that uses the tokens from the cudf JSON tokenizer to process multiple JSON paths in parallel similar to what we do today with multi-get_json_object. We could also just rewrite the query so that parts we feel confident doing with from_json we can do that way. We could also just say that we are at a good point and stay there. But we need to make an informed decision and ideally use more than one benchmark/query to make that decision.

@revans2 revans2 added ? - Needs Triage Need team to review and classify performance A performance related task/issue labels Dec 16, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

2 participants