Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different approach with a few benefits #230

Open
notEvil opened this issue Aug 12, 2023 · 25 comments
Open

Different approach with a few benefits #230

notEvil opened this issue Aug 12, 2023 · 25 comments

Comments

@notEvil
Copy link

notEvil commented Aug 12, 2023

Hi,

I've been tinkering with line_profiler and implemented a different approach to line profiling:

https://github.com/notEvil/line_profiler/tree/sub

Essentially, it keeps track of "calls" (PyTrace_CALL and PyTrace_RETURN are used for generators and async as well) and by doing so records "total time": time spent within a function call, generator or async step, excluding time spent in line profiled functions that were called. From there it calculates "cumulative time".

This approach has multiple benefits:

Potential downside is increased overhead, but in my testing it wasn't much slower.

I've also created a Neovim plugin (not published yet) to see the differences:

import time


@profile
def f():
    time.sleep(0.1)
    (g(1), g(2))
    g(3)


@profile
def g(n):
    if n == 0:
        return

    time.sleep(0.1)
    g(n - 1)


f()

Original
Screenshot from 2023-08-12 13-35-01 crop
New cumulative
Screenshot from 2023-08-12 13-35-21 crop
New total
Screenshot from 2023-08-12 13-35-37 crop

Tested on a real application (40s runtime, no async)

@Erotemic
Copy link
Member

It would help to see the python -m line_profiler <script>.lprof results in addition to the neovim (heatmap?) visualization to get a better sense of what is happening and what the difference is.

@notEvil
Copy link
Author

notEvil commented Aug 12, 2023

There you go

Original

Timer unit: 1e-06 s

Total time: 0.70161 s
File: /home/arappold/git/line_profiler/sub.py
Function: f at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           @profile
     5                                           def f():
     6         1     100060.9 100060.9     14.3      time.sleep(0.1)
     7         1     300288.5 300288.5     42.8      (g(1), g(2))
     8         1     301260.9 301260.9     42.9      g(3)

Total time: 0.601441 s
File: /home/arappold/git/line_profiler/sub.py
Function: g at line 11

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                           @profile
    12                                           def g(n):
    13         9          6.2      0.7      0.0      if n == 0:
    14         3          0.9      0.3      0.0          return
    15                                           
    16         6     601357.4 100226.2    100.0      time.sleep(0.1)
    17         6         76.5     12.8      0.0      g(n - 1)

New cumulative

Timer unit: 1e-06 s

Total time: 0.701555 s
File: /home/arappold/git/line_profiler/sub.py
Function: f at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           @profile
     5                                           def f():
     6         1     100061.0 100061.0     14.3      time.sleep(0.1)
     7         1     300265.7 300265.7     42.8      (g(1), g(2))
     8         1     301228.6 301228.6     42.9      g(3)

Total time: 1.00325 s
File: /home/arappold/git/line_profiler/sub.py
Function: g at line 11

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                           @profile
    12                                           def g(n):
    13         1          7.2      7.2      0.0      if n == 0:
    14         1          3.6      3.6      0.0          return
    15                                           
    16         1     601360.9 601360.9     59.9      time.sleep(0.1)
    17         1     401881.4 401881.4     40.1      g(n - 1)

The prototype doesn't yet count the number of hits (not difficult to add)

@Theelx
Copy link
Collaborator

Theelx commented Aug 12, 2023

That looks promising, but the tracing seems to have a 4x overhead based on the return function in the base case (assuming it was hit 3 times in your cumulative version). While that's not too significant in this case, it can be very significant in tighter loops, and it would be beneficial to create a tight loop to test this in.

@notEvil
Copy link
Author

notEvil commented Aug 12, 2023

How do you infer a 4x overhead from that output? I just tested

@profile
def f():
    for _ in range(int(1e7)):
        g()


@profile
def g():
    pass


f()

and it takes 22.37s with v4.0.3 and 18.53s with the new approach. The prototype runs the new approach in addition to the original one, so I temporarily disabled the original part for this crude benchmark.

@Theelx
Copy link
Collaborator

Theelx commented Aug 12, 2023

Ah, I looked at the return statement in the base case and it looked like it took 1.2us per hit in your version whereas the original took 0.3us per hit. Usually, very simple lines like a plain return statement are virtually instant in Python, and almost all the time taken is in the profiler.

@Theelx
Copy link
Collaborator

Theelx commented Aug 12, 2023

Here's a benchmark that shows around a 11x slowdown with the current line_profiler vs unprofiled code (total time 500ms unprofiled vs 5.4s profiled):

Time Elapsed tight: 494.96ms
Time Elapsed tight: 5390.15ms
Slowdown from profiling: 4895.19ms, 10.89x

Total time: 5.39021 s
File: tight.py
Function: main at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           def main():
     4         1          3.1      3.1      0.0      start = time.perf_counter()
     5  20000001    2628245.8      0.1     48.8      for x in range(20000000):
     6  20000000    2761886.3      0.1     51.2          y = x
     7         1         31.5     31.5      0.0      elapsed_ms = round((time.perf_counter()-start)*1000, 2)
     8         1         42.2     42.2      0.0      print(f"Time Elapsed tight: {elapsed_ms}ms")
     9         1          1.3      1.3      0.0      return elapsed_ms

Your code gives these results:

Time Elapsed tight: 498.77ms
Time Elapsed tight: 18367.76ms
Slowdown from profiling: 17868.989999999998ms, 36.83x
Wrote profile results to tight.py.lprof
Timer unit: 1e-06 s

Total time: 5.91982 s
File: /home/theel/Coding/line_profiler/line_profiler_cumulative/tight.py
Function: main at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           def main():
     4         1          1.7      1.7      0.0      start = time.perf_counter()
     5  20000001    2961819.0      0.1     50.0      for x in range(20000000):
     6  20000000    2957938.7      0.1     50.0          y = x
     7         1         18.6     18.6      0.0      elapsed_ms = round((time.perf_counter()-start)*1000, 2)
     8         1         39.2     39.2      0.0      print(f"Time Elapsed tight: {elapsed_ms}ms")
     9         1          0.6      0.6      0.0      return elapsed_ms

I believe the total time elapsed shown by line_profiler is incorrect, as I definitely waited at least 14 seconds for yours to finish (based off watching the clock). This indicates that your code has a ~37x slowdown, which is ~3x more overhead than the current line_profiler.

@Theelx
Copy link
Collaborator

Theelx commented Aug 12, 2023

The thing to remember (which I forgot in my first comment here!) is that line_profiler's measurements of time taken in the output aren't actually the overhead, as line_profiler calls hpTimer() to try to measure the overhead, and eliminate it from the output.

@notEvil
Copy link
Author

notEvil commented Aug 12, 2023

I fail to reproduce your results. Could you try to run them again with the new commit. I changed a few things so you can get a fair comparison with your benchmark.

@Theelx
Copy link
Collaborator

Theelx commented Aug 12, 2023

This version is significantly better, but still about 1.5x slower:

Time Elapsed tight: 513.24ms
Time Elapsed tight: 9631.74ms
Slowdown from profiling: 9118.5ms, 18.77x
Wrote profile results to tight.py.lprof
Timer unit: 1e-06 s

Total time: 4.17773 s
File: /home/theel/Coding/line_profiler/line_profiler_cumulative/tight.py
Function: main at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           def main():
     4         1          1.7      1.7      0.0      start = time.perf_counter()
     5         1    2005221.9 2005221.9     48.0      for x in range(20000000):
     6         1    2172444.6 2172444.6     52.0          y = x
     7         1         21.0     21.0      0.0      elapsed_ms = round((time.perf_counter()-start)*1000, 2)
     8         1         38.0     38.0      0.0      print(f"Time Elapsed tight: {elapsed_ms}ms")
     9         1          0.7      0.7      0.0      return elapsed_ms

@notEvil
Copy link
Author

notEvil commented Aug 12, 2023

The weird thing is, my results are the complete opposite. v4.0.3 gives me "Time Elapsed tight: 12030.22ms" while the new commit "Time Elapsed tight: 6169.69ms" (tripple checked). My process

  • checkout tag/branch
  • pipenv --rm
  • pipenv run pip install /path/to/repo (Python 3.11.3)
  • pipenv run kernprof -l ...

@Theelx
Copy link
Collaborator

Theelx commented Aug 12, 2023

Ohhhhh, the issue is that I was using my fork from #203 instead of the current one. I can confirm that yours is faster than the version of line_profiler in this repo, but it would be slower if #203 were to be merged.

@notEvil
Copy link
Author

notEvil commented Aug 12, 2023

Great to hear :) I think most improvements can be carried over. I'll give it a try sometime next week

@Theelx
Copy link
Collaborator

Theelx commented Aug 12, 2023

See notEvil#1

@notEvil
Copy link
Author

notEvil commented Aug 12, 2023

Nice, thanks! So if I understand correctly, performance is fine.

Regarding the differences to the current approach, the implementation at the bottom of _line_profiler.pyx is supposed to be readable. Just ignore the parts about call. If its not clear, I can add comments pointing out its details.

@notEvil
Copy link
Author

notEvil commented Aug 13, 2023

The latest commit now counts the number of line hits and provides the times in a backwards compatible way.

Timer unit: 1e-06 s

Total time: 0.700474 s
File: /home/arappold/git/line_profiler/sub.py
Function: f at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           @profile
     5                                           def f():
     6         1     100057.6 100057.6     14.3      time.sleep(0.1)
     7         1     300195.1 300195.1     42.9      (g(1), g(2))
     8         1     300221.3 300221.3     42.9      g(3)

Total time: 1.00071 s
File: /home/arappold/git/line_profiler/sub.py
Function: g at line 11

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                           @profile
    12                                           def g(n):
    13         9          2.4      0.3      0.0      if n == 0:
    14         3          0.5      0.2      0.0          return
    15                                           
    16         6     600365.7 100060.9     60.0      time.sleep(0.1)
    17         6     400342.9  66723.8     40.0      g(n - 1)

"Total time" reported for g is wrong because its summing up cumulative times which don't add up for recursive functions. Total times would, but the default output should remain cumulative times. So maybe remove the "% Time" column if its a recursive function (can be inferred from call stats)?

@notEvil
Copy link
Author

notEvil commented Aug 13, 2023

An example of what call stats would allow: graph.dot.pdf
created with https://github.com/notEvil/line_profiler/blob/sub/graph.py
edit: keep in mind that generators create "calls" on every next/yield and therefore directly into sub functions

@notEvil
Copy link
Author

notEvil commented Aug 16, 2023

I'm happy to announce the next iteration :) Getting useful numbers in case of recursive code was a lot more difficult than anticipated, but here they are nonetheless. The last commit

  • adds block_times: how much time is spent in each function,
  • adds call_times: how much time is spent in the function called
  • and fixes "cumulative time" of lines

It is impossible to derive block_times and call_times from the line timings nor from call stats since in the presence of recursion, cumulative times don't add up.

Now the output for the initial example is

Timer unit: 1e-06 s

Total time: 0.700534 s
File: /home/arappold/git/line_profiler/example1.py
Function: f at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           @profile
     5                                           def f():
     6         1     100061.6 100061.6     14.3      time.sleep(0.1)
     7         1     300238.6 300238.6     42.9      (g(1), g(2))
     8         1     300233.5 300233.5     42.9      g(3)

Total time: 0.600456 s
File: /home/arappold/git/line_profiler/example1.py
Function: g at line 11

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                           @profile
    12                                           def g(n):
    13         9          3.6      0.4      0.0      if n == 0:
    14         3          0.8      0.3      0.0          return
    15                                           
    16         6     600380.3 100063.4    100.0      time.sleep(0.1)
    17         6     300267.2  50044.5     50.0      g(n - 1)
Testet with
Code
import time


def wait(n):
    time.sleep(n * 0.01)


@profile
def f(n):
    f1(n)
    f2(n)
    f3(n)
    f4(n)
    f5(n)
    f6(n)


@profile
def f1(n):
    if n == 0:
        return

    wait(1)
    f1(n - 1)


@profile
def f2(n):
    if n == 0:
        return

    wait(1)
    (f2(n - 1), f2(n // 2))


@profile
def f3(n):
    if n == 0:
        return

    wait(1)
    f3_2(n - 1)


@profile
def f3_2(n):
    if n == 0:
        return

    wait(1)
    f3(n - 1)


@profile
def f4(n):
    if n == 0:
        return

    wait(1)
    (f4_2(n - 1), f4_2(n // 2))


@profile
def f4_2(n):
    if n == 0:
        return

    wait(1)
    (f4(n - 1), f4(n // 2))


@profile
def f5(n):
    if n == 0:
        return

    wait(1)
    f5_2(n - 1)


@profile
def f5_2(n):
    if n == 0:
        return

    wait(1)
    f5_3(n - 1)


@profile
def f5_3(n):
    if n == 0:
        return

    wait(1)
    f5(n - 1)


@profile
def f6(n):
    if n == 0:
        return

    wait(1)
    (f6_2(n - 1), f6_2(n // 2))


@profile
def f6_2(n):
    if n == 0:
        return

    wait(1)
    (f6_3(n - 1), f6_3(n // 2))


@profile
def f6_3(n):
    if n == 0:
        return

    wait(1)
    (f6(n - 1), f6(n // 2))


f(6)

graph.dot.pdf

Output
Timer unit: 1e-06 s

Total time: 0.754708 s
File: /home/arappold/git/line_profiler/sub.py
Function: f at line 8

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     8                                           @profile
     9                                           def f(n):
    10         1      60377.5  60377.5      8.0      f1(n)
    11         1     191180.4 191180.4     25.3      f2(n)
    12         1      60361.0  60361.0      8.0      f3(n)
    13         1     191203.2 191203.2     25.3      f4(n)
    14         1      60369.8  60369.8      8.0      f5(n)
    15         1     191216.3 191216.3     25.3      f6(n)

Total time: 0.0603742 s
File: /home/arappold/git/line_profiler/sub.py
Function: f1 at line 18

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    18                                           @profile
    19                                           def f1(n):
    20         7          1.3      0.2      0.0      if n == 0:
    21         1          0.1      0.1      0.0          return
    22                                           
    23         6      60350.0  10058.3    100.0      wait(1)
    24         6      50312.4   8385.4     83.3      f1(n - 1)

Total time: 0.191177 s
File: /home/arappold/git/line_profiler/sub.py
Function: f2 at line 27

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    27                                           @profile
    28                                           def f2(n):
    29        39          4.8      0.1      0.0      if n == 0:
    30        20          1.1      0.1      0.0          return
    31                                           
    32        19     191067.2  10056.2     99.9      wait(1)
    33        19     181120.1   9532.6     94.7      (f2(n - 1), f2(n // 2))

Total time: 0.0603579 s
File: /home/arappold/git/line_profiler/sub.py
Function: f3 at line 36

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    36                                           @profile
    37                                           def f3(n):
    38         4          0.6      0.1      0.0      if n == 0:
    39         1          0.1      0.1      0.0          return
    40                                           
    41         3      30169.0  10056.3     50.0      wait(1)
    42         3      50301.2  16767.1     83.3      f3_2(n - 1)

Total time: 0.0502975 s
File: /home/arappold/git/line_profiler/sub.py
Function: f3_2 at line 45

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    45                                           @profile
    46                                           def f3_2(n):
    47         3          0.4      0.1      0.0      if n == 0:
    48                                                   return
    49                                           
    50         3      30168.7  10056.2     60.0      wait(1)
    51         3      40241.4  13413.8     80.0      f3(n - 1)

Total time: 0.191199 s
File: /home/arappold/git/line_profiler/sub.py
Function: f4 at line 54

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    54                                           @profile
    55                                           def f4(n):
    56        21          2.8      0.1      0.0      if n == 0:
    57        12          0.7      0.1      0.0          return
    58                                           
    59         9      90512.6  10057.0     47.3      wait(1)
    60         9     181143.0  20127.0     94.7      (f4_2(n - 1), f4_2(n // 2))

Total time: 0.181137 s
File: /home/arappold/git/line_profiler/sub.py
Function: f4_2 at line 63

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    63                                           @profile
    64                                           def f4_2(n):
    65        18          2.2      0.1      0.0      if n == 0:
    66         8          0.4      0.0      0.0          return
    67                                           
    68        10     100576.7  10057.7     55.5      wait(1)
    69        10     161017.3  16101.7     88.9      (f4(n - 1), f4(n // 2))

Total time: 0.060366 s
File: /home/arappold/git/line_profiler/sub.py
Function: f5 at line 72

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    72                                           @profile
    73                                           def f5(n):
    74         3          0.4      0.1      0.0      if n == 0:
    75         1          0.1      0.1      0.0          return
    76                                           
    77         2      20113.6  10056.8     33.3      wait(1)
    78         2      50308.6  25154.3     83.3      f5_2(n - 1)

Total time: 0.0503043 s
File: /home/arappold/git/line_profiler/sub.py
Function: f5_2 at line 81

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    81                                           @profile
    82                                           def f5_2(n):
    83         2          0.3      0.2      0.0      if n == 0:
    84                                                   return
    85                                           
    86         2      20115.2  10057.6     40.0      wait(1)
    87         2      40247.2  20123.6     80.0      f5_3(n - 1)

Total time: 0.0402436 s
File: /home/arappold/git/line_profiler/sub.py
Function: f5_3 at line 90

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    90                                           @profile
    91                                           def f5_3(n):
    92         2          0.3      0.1      0.0      if n == 0:
    93                                                   return
    94                                           
    95         2      20113.8  10056.9     50.0      wait(1)
    96         2      30186.7  15093.3     75.0      f5(n - 1)

Total time: 0.191212 s
File: /home/arappold/git/line_profiler/sub.py
Function: f6 at line 99

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    99                                           @profile
   100                                           def f6(n):
   101        13          1.8      0.1      0.0      if n == 0:
   102         6          0.3      0.0      0.0          return
   103                                           
   104         7      70401.5  10057.4     36.8      wait(1)
   105         7     181155.2  25879.3     94.7      (f6_2(n - 1), f6_2(n // 2))

Total time: 0.181149 s
File: /home/arappold/git/line_profiler/sub.py
Function: f6_2 at line 108

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   108                                           @profile
   109                                           def f6_2(n):
   110        14          1.7      0.1      0.0      if n == 0:
   111         8          0.5      0.1      0.0          return
   112                                           
   113         6      60347.1  10057.8     33.3      wait(1)
   114         6     161035.2  26839.2     88.9      (f6_3(n - 1), f6_3(n // 2))

Total time: 0.161023 s
File: /home/arappold/git/line_profiler/sub.py
Function: f6_3 at line 117

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   117                                           @profile
   118                                           def f6_3(n):
   119        12          1.6      0.1      0.0      if n == 0:
   120         6          0.3      0.1      0.0          return
   121                                           
   122         6      60343.6  10057.3     37.5      wait(1)
   123         6     120793.6  20132.3     75.0      (f6(n - 1), f6(n // 2))

@Theelx
Copy link
Collaborator

Theelx commented Aug 16, 2023

Thanks for this great work! I'm going to rebase my PR to your fork on this latest work, could you take a look at it once I do that? After this work of yours is merged, I can start working on adding CPython 3.12 support to line_profiler.

@notEvil
Copy link
Author

notEvil commented Aug 16, 2023

Thanks :)
I would still consider it work-in-progress. The call stats feature is somewhat out of scope of line profiling, naming of stuff breaks with former conventions (which is different from the more common convention of "total" vs "cumulative"), some parts are pretty ugly, the default output from CLI could make better use of the available information, ...
And there is performance, which has to be worse than before (when both are equally optimized).
It would be great if someone more experienced with the existing code base could review the fork and propose changes.

@Theelx
Copy link
Collaborator

Theelx commented Aug 16, 2023

I just tested the performance, and it is indeed significantly worse than before. Previously, notEvil#1 was able to pull it down to 13x overhead over no profiling, in the tight loop case from above, whereas now with the same optimizations applied, it'd 16x overhead over no profiling. That being said, I believe I can optimize it further, as some of your data structure choices may be suboptimal performance-wise.

@tmm1
Copy link

tmm1 commented Aug 21, 2023

@notEvil thanks for sharing this! many years ago i wrote a line profiler for another dynamic language runtime and I think I used a similar call/return stacking approach.

if you get a chance, would you mind taking a look at the merge conflicts pulling in the latest main after the recent autoprofiler changes. would love to try the two together.

@notEvil
Copy link
Author

notEvil commented Oct 17, 2023

@tmm1 sry for not responding! I will check the conflicts soon, I promise ^^

I've pushed a new commit* which should improve confidence in the solution. The previous version was manually tested on a few examples with manual inspection of the output. graph.py was very helpful, but obviously we'd want something automated with a lot more variation. So I started the journey of implementing strategies for generating code with Hypothesis**, implementing a reference profiler and using it to validate the numbers. And it took a while to work as I wanted it to (statements spanning multiple lines, recursions, generators, async, ...), to look reasonably good and eventually for the tests to pass. So my apologies for keeping this private and not providing any updates.

The testing is not perfect:

  • It might generate lots of useless examples
  • It uses busy loops to simulate work
    • context switches introduce inaccuracies
    • busy loops are more accurate than time.sleep
    • inaccurate timings can and will eventually produce false positives
  • It doesn't validate call profiles (hit counts and cumulative time) and call stats
    • a "call" (CALL RETURN pair) might be a function call or a generator/coroutine continuation
    • it is not easy to add generator/coroutine "call"s to the reference profiler
    • I don't think it makes sense to further complicate the Cython code to distinguish them

What I like:

  • The generated examples look very diverse and complex
  • The reference profiler is very simple

So now that it "works" and won't get any more complex, I'd say its a good time to review and benchmark.

https://github.com/notEvil/line_profiler/tree/sub

* changes

  • added counting of hits for blocks and calls
  • added counting of primitive hits for lines, blocks and calls
  • added randomized tests
  • improved graph.py to show the additional information
  • renamed structs LineTime and Timing to CLineProfile and LineProfile respectively
  • added structs CBlockProfile and CCallProfile, and classes BlockProfile and CallProfile
  • fixed minor regressions

** If you don't know Hypothesis yet, I'd definitely recommend checking it out!

@notEvil
Copy link
Author

notEvil commented Oct 19, 2023

@tmm1 just merged main and except for trivial conflicts it went smooth. The randomized tests still pass :) Beware though that line_profiler.py isn't adjusted yet and might show unexpected things, especially aggregates.

@Theelx you created a PR back then for better performance and comparisons. I didn't merge because I didn't want to mix the two proposals and it looked like it was based off of #203 but without git reference (merge). Would you consider the PR mature enough to merge with sub on a test branch?

@notEvil
Copy link
Author

notEvil commented Nov 3, 2023

Just a quick update: I tried to measure and improve performance and got some interesting results. The benchmark is https://gitlab.com/notEvil/code_collapse applied to itself which spends most of the time in Black and the remaining time in nested generators (tree traversal) and LibCST.

name Python 3.11 vs baseline Python 3.12 vs baseline
baseline 16.1 34.3
v4.1.2 27.2 168.9% 178.6 520.7%
sub 28.2 175.2% 167.1 487.2%
key opt 28 173.9% 161.6 471.1%
tss opt* 27.6 171.4%
preshed opt* 27.2 168.9%
map opt* 26.3 163.4%
time opt* 25.8 160.3%
all opt 25.2 156.5% 150.9 439.9%
  • key opt: use address of bytecode << 24 + line number as key (with lower bits of address as offset for some randomness)
    • not worth it imo, but the first one I tried
  • tss opt: store thread id in thread local storage to save some threading.get_ident()
  • preshed opt: use Preshed for entry check
  • map opt: use parallel_flat_hash_map and flat_hash_map from parallel-hashmap instead of unordered_map
  • time opt: check time just after entry check instead of asap
  • all opt: all combined, see https://github.com/notEvil/line_profiler/tree/subopt

As you can see, I wasn't able to significantly improve performance of sub. Maybe time opt is worth the tradeoff (creates some bias), what do you think?

I will try to figure out why Python 3.12 is performing so bad compared to 3.11 on my project. Regarding the bad performance when line profiling on 3.12, can you reproduce this? e.g. jsonpickle

Update 2023-11-04

Python 3.12 is a bit slower than 3.11 in general, but the large difference is due to Black not compiling for 3.12 (due to a bug in Mypyc). So the benchmark is flawed in so far as it presents a measure of slowdown that is not representative (most time is spent in compiled code which doesn't add any overhead). Maybe we could run pyperformance somehow.

Also regarding key opt, turns out it is not uncommon for short functions to have the exact same bytecode (get_metadata in pyperf and at least two in re) and therefore the same hash. So sub is currently broken (https://github.com/notEvil/line_profiler/blob/f0c17516e1012dd99e9e161a6a5057a97b60a02b/line_profiler/_line_profiler.pyx#L593) and current line_profiler is affected if the line ranges of such functions intersect. key opt would solve it.

* either directly from #203 or inspired by

@notEvil
Copy link
Author

notEvil commented Nov 5, 2023

Good news I guess:

name baseline v4.1.2 sub preshed opt key opt tss opt time opt map opt subopt
generators 57.18 52.70 58.95 56.27 56.43 53.36 53.65 52.26 48.58
deltablue 4.46 36.44 32.43 30.90 30.04 28.51 28.67 28.87 25.98
richards 54.57 34.22 28.39 26.44 25.54 23.47 24.36 23.68 23.72
richards_super 66.02 31.38 26.71 24.68 23.29 21.88 22.82 22.03 21.61
raytrace 331.44 26.21 25.36 24.02 22.78 21.72 21.95 21.25 20.72
scimark_sor 135.53 26.55 24.00 23.27 22.11 20.61 21.08 20.71 20.41
scimark_lu 128.57 24.43 21.93 21.37 20.67 19.62 19.61 19.04 19.37
hexiom 7.08 24.97 22.51 21.27 20.02 18.92 19.62 18.90 18.42
go 154.55 31.02 23.95 21.68 21.09 19.64 20.74 19.70 18.35
spectral_norm 116.12 19.70 17.70 17.10 16.44 15.51 15.74 15.54 15.18
logging_silent 0.00 24.76 19.56 16.20 16.40 14.26 16.67 14.63 13.42
scimark_monte_carlo 75.35 23.64 16.81 15.02 14.47 13.15 14.32 13.40 12.67
pyflate 484.89 18.75 14.83 13.62 13.20 11.84 12.91 12.24 11.91
float 90.14 16.91 13.15 12.60 12.27 11.48 11.83 11.47 11.33
deepcopy_memo 0.04 25.66 17.43 13.73 14.40 12.50 14.69 13.26 11.31
chaos 75.99 17.22 13.79 12.68 12.63 11.41 12.09 11.56 11.24
unpack_sequence 0.00 19.93 12.92 11.42 11.29 10.12 11.33 10.69 10.10
deepcopy_reduce 0.00 18.36 13.14 10.96 11.23 9.99 11.39 10.40 9.33
deepcopy 0.40 18.00 12.54 10.02 10.59 9.40 11.00 9.37 8.31
pprint_pformat 1645.84 15.44 10.61 9.34 9.27 8.11 9.24 8.54 7.90
logging_simple 0.01 13.87 10.63 8.89 8.88 7.92 8.84 8.01 7.59
pprint_safe_repr 804.19 15.28 10.22 8.56 8.73 7.59 8.81 8.08 7.54
logging_format 0.01 12.97 10.10 8.41 8.42 7.58 8.42 7.61 7.25
nbody 112.98 14.13 8.85 7.67 7.67 6.83 7.76 7.32 6.74
comprehensions 0.03 7.25 7.35 7.12 7.03 6.73 6.31 6.52 6.30
fannkuch 426.18 14.32 8.48 7.33 7.16 6.16 7.23 7.00 6.18
tomli_loads 2337.04 6.11 6.12 6.78 6.14 6.03 4.79 5.57 5.60
coroutines 28.96 5.29 5.58 5.71 5.71 5.61 4.92 5.37 5.05
genshi_text 25.17 5.47 6.08 5.88 6.00 5.99 4.88 5.44 4.99
scimark_sparse_mat_mult 4.77 11.09 6.65 5.73 5.69 5.06 5.73 5.26 4.95
scimark_fft 358.69 10.09 6.39 5.61 5.46 4.83 5.54 5.17 4.79
genshi_xml 57.81 5.76 5.91 5.51 5.64 5.42 4.77 5.12 4.72
chameleon 7.54 5.41 5.66 5.63 5.57 5.37 4.40 5.11 4.67
nqueens 93.59 7.13 6.04 5.60 5.55 5.28 5.12 5.22 4.67
meteor_contest 119.01 9.93 5.96 5.21 5.19 4.56 5.23 4.88 4.39
async_generators 409.54 5.05 4.87 4.81 4.77 4.66 4.34 4.39 4.29
crypto_pyaes 82.58 8.73 5.38 4.89 4.69 4.27 4.68 4.50 4.21
sqlglot_parse 1.58 4.52 4.67 4.66 4.66 4.68 3.99 4.30 4.19
coverage 82.92 4.06 4.06 4.08 4.08 4.05 4.10 4.03 4.05
async_tree_none_tg 536.83 7.00 5.52 4.77 4.86 4.14 4.53 4.18 3.96
sqlglot_transpile 1.92 4.31 4.43 4.43 4.41 4.44 3.77 4.05 3.94
mako 11.94 4.36 4.44 4.38 4.45 4.41 3.82 4.11 3.91
xml_etree_process 62.94 4.72 4.57 4.76 4.64 4.55 3.70 4.29 3.88
async_tree_memoization_tg 718.20 6.29 4.99 4.41 4.43 3.98 4.48 3.99 3.75
sqlglot_normalize 123.56 3.98 4.17 4.31 4.11 4.08 3.55 3.77 3.69
async_tree_none 603.23 6.02 4.83 4.20 4.20 3.84 4.17 3.85 3.69
async_tree_memoization 765.38 5.85 4.93 4.11 4.16 3.80 4.27 3.79 3.64
async_tree_io_tg 1354.10 5.92 4.79 4.20 4.24 3.82 4.18 3.85 3.57
async_tree_io 1407.59 5.66 4.64 4.01 4.07 3.70 4.02 3.70 3.51
sqlglot_optimize 60.38 3.86 3.97 4.00 3.99 3.94 3.43 3.55 3.48
html5lib 67.36 4.46 4.00 3.85 3.86 3.77 3.45 3.66 3.38
async_tree_cpu_io_mixed_tg 823.45 5.33 4.33 3.72 3.75 3.41 3.71 3.40 3.24
async_tree_cpu_io_mixed 850.96 5.24 4.14 3.59 3.62 3.33 3.60 3.32 3.17
xml_etree_generate 91.35 3.78 3.61 3.89 3.65 3.63 3.00 3.41 3.12
mdp 3100.88 4.11 3.64 3.41 3.39 3.24 3.14 3.15 3.01
json_dumps 13.07 5.30 3.89 3.24 3.36 2.99 3.38 3.10 2.84
sqlite_synth 0.00 3.98 3.18 3.01 3.01 2.81 2.95 2.88 2.78
telco 7.70 5.50 3.72 3.07 3.03 2.72 3.08 2.95 2.77
sympy_expand 541.44 2.84 2.96 3.05 2.96 2.92 2.51 2.76 2.76
sympy_str 334.32 2.75 2.88 3.01 2.89 2.86 2.47 2.68 2.71
sympy_integrate 23.89 2.81 2.95 3.10 2.98 2.95 2.44 2.74 2.70
typing_runtime_protocols 0.54 2.83 2.82 2.94 2.76 2.76 2.51 2.66 2.63
pathlib 20.81 3.95 3.37 2.94 3.03 2.74 2.99 2.79 2.62
dulwich_log 82.35 3.18 2.96 2.87 2.87 2.76 2.64 2.63 2.59
sqlalchemy_imperative 22.06 2.80 2.75 2.67 2.70 2.70 2.45 2.49 2.45
sympy_sum 196.00 2.39 2.48 2.54 2.47 2.45 2.17 2.31 2.30
create_gc_cycles 1.29 2.66 1.95 2.87 2.48 2.33 1.91 2.38 2.27
tornado_http 146.44 2.48 2.43 2.60 2.36 2.33 2.18 2.25 2.26
sqlalchemy_declarative 154.12 2.38 2.39 2.47 2.37 2.35 2.12 2.22 2.24
docutils 3014.27 2.41 2.45 2.46 2.43 2.41 2.14 2.32 2.20
xml_etree_iterparse 126.73 2.22 2.29 2.39 2.32 2.25 1.94 2.18 2.05
regex_v8 23.23 2.22 1.79 1.70 1.72 1.62 1.70 1.63 1.58
json_loads 0.03 1.88 1.65 1.52 1.53 1.47 1.55 1.46 1.43
pidigits 204.68 1.55 1.46 1.43 1.41 1.38 1.39 1.39 1.38
gc_traversal 4.39 1.52 1.35 1.45 1.50 1.31 1.28 1.38 1.35
asyncio_tcp 969.36 nan nan nan nan 1.40 1.36 1.33 1.30
xml_etree_parse 182.39 1.24 1.25 1.31 1.25 1.26 1.17 1.22 1.21
regex_effbot 3.18 1.26 1.18 1.14 1.14 1.13 1.15 1.13 1.12
asyncio_tcp_ssl 2924.68 nan nan nan nan 1.12 1.12 1.11 1.10
2to3 318.27 0.97 0.95 0.98 0.98 0.98 0.96 0.93 1.06
pickle 0.01 1.19 1.11 1.09 1.08 1.06 1.09 1.08 1.06
unpickle 0.01 1.13 1.07 1.05 1.05 1.04 1.05 1.06 1.05
pickle_list 0.00 1.12 1.07 1.05 1.05 1.04 1.05 1.05 1.04
unpickle_list 0.01 1.09 1.05 1.04 1.04 1.05 1.04 1.02 1.03
asyncio_websockets 643.37 1.06 1.05 1.05 nan 1.00 1.01 1.01 1.01
regex_dna 187.26 0.99 1.00 0.99 0.99 0.99 0.99 0.99 1.00
pickle_dict 0.03 1.02 1.01 1.00 1.00 1.00 1.00 1.00 1.00
python_startup 11.26 0.98 0.95 1.00 0.97 0.96 0.95 0.95 0.98
python_startup_no_site 8.62 0.98 0.95 0.95 0.99 0.95 0.95 0.96 0.97
bench_mp_pool 14.86 nan nan nan nan nan nan nan nan
bench_thread_pool 1.22 nan nan nan nan nan nan nan nan
dask 500.97 nan nan nan nan nan nan nan nan
django_template 37.73 nan nan nan nan nan nan nan nan
pickle_pure_python 0.34 nan nan nan nan nan nan nan nan
regex_compile 150.42 nan nan nan nan nan nan nan nan
unpickle_pure_python 0.26 nan nan nan nan nan nan nan nan

Average factor (geometric mean, complete case only)

subopt         4.251947
map opt        4.499286
time opt       4.535526
tss opt        4.547399
key opt        4.908084
preshed opt    4.987763
sub            5.329236
v4.1.2         6.309396
dtype: float64

I managed to adjust pyperformance so it uses line_profiler (with autoprofile) and ran it*. See https://github.com/notEvil/line_profiler/tree/sub/pyperformance and https://github.com/notEvil/pyperformance/tree/line_profiler if you want to do the same. Set up the Pipenv environment and run prepare.py to prepare an environment. Then run pyperformance in that environment.

The column baseline is the average time in ms, the others are scaling factors (e.g. 1.5 means 1.5 * baseline). Some benchmarks like 2to3 spawn subprocesses which explains some ~1 factors. Didn't investigate the < 1 yet. nevermind, should've used pyperf to load the json from the start.

Update 2023-11-06

Added the opt variants and sorted by average factor. Interestingly, the success of time opt indicates benchmarks where the entry check fails a lot. However, map opt is on par and doesn't add bias. So overall my take is: preshed is not worth it**, time opt might be and map opt, tss opt and key opt are. What is still strange is that sub and key opt are so different although they use a very similar approach. (sub used id() while key opt a type cast. With type cast, sub is a lot closer to key opt now).

* on Intel 4570S with 3 of them in parallel (1 idle core). Most nan are due to conflicts (e.g. bind to port).
** not better than key opt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants