Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic sorting type is slow #21

Open
51-code opened this issue Jun 18, 2024 · 3 comments
Open

Automatic sorting type is slow #21

51-code opened this issue Jun 18, 2024 · 3 comments
Labels
bug Something isn't working feature Existing feature

Comments

@51-code
Copy link
Contributor

51-code commented Jun 18, 2024

Describe the bug

Using the automatic sorting type in sort command results in a significant increase of query time. The culprit seems to be the numericalStringCheck() function. The function should be implemented differently, performance in mind.

Expected behavior

The automatic sorting shouldn't increase the query time too much.

How to reproduce

Run sort first with default sorting:

%dpl
index=crud earliest=-3y | spath | sort elapsed

The query took 4 min 22 sec for me.

Then run sort with the auto sorting:

%dpl
index=crud earliest=-3y | spath | sort auto(elapsed)

The query took 7 min 39 sec for me, almost doubling the query time.

sort can also take multiple columns to sort with. Two columns with auto sorting would again increase the query time close to 11 minutes.

Screenshots

Software version

DPF-02 version 3.0.0
PTH-10 version 5.3.0-7-ge44d00e9

Desktop (please complete the following information if relevant):

  • OS:
  • Browser:
  • Version:

Additional context

The auto sorting is a very useful tool for many cases because in PTH-10 some commands change the datatype of columns to String, as they use Spark's User Defined Functions that can only return a single datatype. The downside for that is that it brakes any sorting for numerical values, which in turn the auto sorting deals with.

For example in PTH-10 issue #256 default sorting for chart and stats are being made, but they suffer from the same performance issues, if the auto sorting is to be used to fix the problem of using e.g. spath before the commands. (spath uses UDF's and changes everything in the dataset to String)

Matching numbers with regex in numericalStringCheck() already tried, but it didn't improve performance.

@51-code 51-code added bug Something isn't working feature Existing feature labels Jun 18, 2024
@51-code
Copy link
Contributor Author

51-code commented Jun 18, 2024

Other possible solutions that came to mind for PTH-10:

  1. fix the underlying issue of UDF commands returning String (might very well be impossible and has been tried before).
  2. the commands turning columns to String could be flagged with a new CommandProperty and then auto sorting could be only applied when it is really needed, avoiding the performance issue in most cases.

@StrongestNumber9
Copy link
Contributor

Does this scale in O(n), as in 10x input results in 10x processing time? Our internal datasets are relatively small so it might be a good idea to verify if theres some more designing to do

@51-code
Copy link
Contributor Author

51-code commented Jun 18, 2024

I tested with the same query and with varying amounts of data. The original query (100% in this case) was 539071 records.

Summed up:
Using auto sorting increases the query time incrementally, but it does seem to cap somewhere along the lines of 90-100%, so applying the automatic datatype check is O(n).

Below are the results.

Results with 185% of the dataset (query time increase of 91%):

  • default sort: 10min 56sec
  • auto sort: 20min 55sec

Results with 100% of the dataset (query time increase of 75%):

  • default sort: 4min 22sec
  • auto sort: 7min 39sec

Results with 53% of the dataset (query time increase of 98%):

  • default sort: 1min 22sec
  • auto sort: 2min 43sec

Results with 38% of the dataset (query time increase of 53%):

  • default sort: 1min 13sec
  • auto sort: 1min 52sec

Results with 22% of the dataset (query time increase of 27%):

  • default sort: 55 sec
  • auto sort: 1min 10sec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature Existing feature
Projects
None yet
Development

No branches or pull requests

2 participants