Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying diffs is very slow on moderately large repositories #124

Open
mplanchard opened this issue Oct 2, 2024 · 2 comments
Open

Querying diffs is very slow on moderately large repositories #124

mplanchard opened this issue Oct 2, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@mplanchard
Copy link

mplanchard commented Oct 2, 2024

Describe the bug

Queries on diffs for even moderately large repositories are incredibly slow. Our repository at work has ~5,500 commits.

The following operation to get the diff with the most deletions took ~30 minutes:

❯ time .cargo/bin/gitql --query 'select * from diffs order by deletions desc limit 1'
╭──────────────────────────────────────────┬───────────────────┬───────────────────────┬────────────┬───────────┬───────────────┬─────────────────────────┬───────────────────────────────────╮
│ commit_id                                ┆ name              ┆ email                 ┆ insertions ┆ deletions ┆ files_changed ┆ datetime                ┆ repo                              │
╞══════════════════════════════════════════╪═══════════════════╪═══════════════════════╪════════════╪═══════════╪═══════════════╪═════════════════════════╪═══════════════════════════════════╡
│ 8b685201464c3027afe9105bb5ed9b40a1befce7 ┆ Matthew Planchard ┆ [email protected] ┆ 3284       ┆ 41552     ┆ 212           ┆ 2024-08-15 18:15:45.000 ┆ /home/matthew/s/spec/.git         │
╰──────────────────────────────────────────┴───────────────────┴───────────────────────┴────────────┴───────────┴───────────────┴─────────────────────────┴───────────────────────────────────╯

________________________________________________________
Executed in   27.37 mins    fish           external
   usr time   27.25 mins  569.00 micros   27.25 mins
   sys time    0.04 mins    0.00 micros    0.04 mins

During the entire time, a single thread was pretty much pegged. I can get this same result using git and awk in a fraction (1/270th, 0.37%) of the time:

❯ time git log --pretty="@%h" --shortstat | tr "\n" " " | tr "@" "\n" | awk '{if ($7 > deletions) { deletions = $7; commit = $1 }}; END { print commit; print deletions }' 
8b6852014
41720

________________________________________________________
Executed in    6.01 secs    fish           external
   usr time    5.41 secs    0.00 millis    5.41 secs
   sys time    0.63 secs    1.78 millis    0.63 secs

Queries on commits seem to run in a more reasonable amount of time, e.g.:

❯ time .cargo/bin/gitql --query "select count(author_name) from commits where author_name like '%matthew%'"
╭──────────╮
│ column_2 │
╞══════════╡
│ 1001     │
╰──────────╯

________________________________________________________
Executed in  357.45 millis    fish           external
   usr time  351.94 millis    0.00 micros  351.94 millis
   sys time    4.62 millis  641.00 micros    3.98 millis

To Reproduce

  1. Check out any large repo
  2. Run the example command above

Expected behavior
Speed is at least within an order of magnitude of git/awk

GQL (please complete the following information):
GitQL version 0.28.0

Additional context
Add any other context about the problem here.

@AmrDeveloper AmrDeveloper added the enhancement New feature or request label Oct 3, 2024
@AmrDeveloper
Copy link
Owner

Hello @mplanchard,

I am totally agree with you that diffs table should be faster and this can fixed using many ways

  • More optimisation in the diff code provider code.
  • When finishing the logical plan and planner.
  • Support to calculate the diff in multi threads.

But now i am thinking to work step by step to get more optimisation an cover more features in general then moving to optimize specific cases.

But after those features i think we can get the ability to perform more customisable and faster queries

Thank you,
Amr

@AmrDeveloper
Copy link
Owner

Gitql 0.34.0 is now 50% faster with more functionality on diff content

https://github.com/AmrDeveloper/GQL/releases/tag/0.34.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants