Bench/new test (#41)

* add: benchmark
shenxiangzhuang · Apr 29, 2024 · b0fb3f4 · b0fb3f4
1 parent 5d1cbe6
commit b0fb3f4
Show file tree

Hide file tree

Showing 12 changed files with 468 additions and 43 deletions.
diff --git a/README.md b/README.md
@@ -53,44 +53,73 @@ print(results)
 ```
 
 ## Benchmark
+
+**TLDR: We got more than 10x speedup when the corpus size beyond 100K**
+
+<p align="center">
+  <img src="./benchmark/bench.png" alt="Benchmark" width="400" height="300">
+</p>
+
 We use the demo data shown in quick start to do this simple benchmark.
 You can check the [benchmark/simple](./benchmark/simple) for the benchmark source code.
 
-- Benchmark1: bleuscore
-- Benchmark2: huggingface evaluate bleu algorithm in **local**
-- Benchmark3: sacrebleu
+- `rs_bleuscore`: bleuscore python library
+- `local_hf_bleu`: huggingface evaluate bleu algorithm in **local**
+- `sacre_bleu`: sacrebleu
   - Note that we got different result with sacrebleu in the simple demo data and all the rests have same result
-- Benchmark4: huggingface evaluate bleu algorithm with **evaluate** package
+- `hf_evaluate`: huggingface evaluate bleu algorithm with **evaluate** package
 
 
 The `N` is used to enlarge the predictions/references size by simply duplication the demo data as shown before.
-
 We can see that as `N` increase, the bleuscore gets better performance.
+You can navigate [benchmark](./benchmark/README.md) for more benchmark details.
 
-### N=1
 
-<div style="text-align: center;">
-    <img width="80%" src="asset/benchmark/n_1.png">
-</div>
 
 ### N=100
-We will only test the bleuscore and evaluate **local** results from here, 
-because the other two methods are too slow to test quickly.
-
-<div style="text-align: center;">
-    <img width="80%" src="asset/benchmark/n_100.png">
-</div>
 
-### N=10,000
+```bash
+hyhyperfine --warmup 5 --runs 10   \
+  "python simple/rs_bleuscore.py 100" \
+  "python simple/local_hf_bleu.py 100" \
+  "python simple/sacre_bleu.py 100"   \
+  "python simple/hf_evaluate.py 100"
+
+Benchmark 1: python simple/rs_bleuscore.py 100
+  Time (mean ± σ):      19.0 ms ±   2.6 ms    [User: 17.8 ms, System: 5.3 ms]
+  Range (min … max):    14.8 ms …  23.2 ms    10 runs
+
+Benchmark 2: python simple/local_hf_bleu.py 100
+  Time (mean ± σ):      21.5 ms ±   2.2 ms    [User: 19.0 ms, System: 2.5 ms]
+  Range (min … max):    16.8 ms …  24.1 ms    10 runs
+
+Benchmark 3: python simple/sacre_bleu.py 100
+  Time (mean ± σ):      45.9 ms ±   2.2 ms    [User: 38.7 ms, System: 7.1 ms]
+  Range (min … max):    43.5 ms …  50.9 ms    10 runs
+
+Benchmark 4: python simple/hf_evaluate.py 100
+  Time (mean ± σ):      4.504 s ±  0.429 s    [User: 0.762 s, System: 0.823 s]
+  Range (min … max):    4.163 s …  5.446 s    10 runs
+
+Summary
+  python simple/rs_bleuscore.py 100 ran
+    1.13 ± 0.20 times faster than python simple/local_hf_bleu.py 100
+    2.42 ± 0.35 times faster than python simple/sacre_bleu.py 100
+  237.68 ± 39.88 times faster than python simple/hf_evaluate.py 100
+```
 
-<div style="text-align: center;">
-    <img width="80%" src="asset/benchmark/n_10000.png">
-</div>
+### N = 1K ~ 1M
 
-### N=100,000
+| Command                                  |       Mean [ms] | Min [ms] | Max [ms] |        Relative |
+|:-----------------------------------------|----------------:|---------:|---------:|----------------:|
+| `python simple/rs_bleuscore.py 1000`     |      20.3 ± 1.3 |     18.2 |     21.4 |            1.00 |
+| `python simple/local_hf_bleu.py 1000`    |      45.8 ± 1.2 |     44.2 |     47.5 |     2.26 ± 0.16 |
+| `python simple/rs_bleuscore.py 10000`    |      37.8 ± 1.5 |     35.9 |     39.5 |     1.87 ± 0.14 |
+| `python simple/local_hf_bleu.py 10000`   |     295.0 ± 5.9 |    288.6 |    304.2 |    14.55 ± 0.98 |
+| `python simple/rs_bleuscore.py 100000`   |     219.6 ± 3.3 |    215.3 |    224.0 |    10.83 ± 0.72 |
+| `python simple/local_hf_bleu.py 100000`  |   2781.4 ± 42.2 |   2723.1 |   2833.0 |   137.13 ± 9.10 |
+| `python simple/rs_bleuscore.py 1000000`  |   2048.8 ± 31.4 |   2013.2 |   2090.3 |   101.01 ± 6.71 |
+| `python simple/local_hf_bleu.py 1000000` | 28285.3 ± 100.9 |  28182.1 |  28396.1 | 1394.51 ± 90.21 |
 
-<div style="text-align: center;">
-    <img width="80%" src="asset/benchmark/n_100000.png">
-</div>
 
 
diff --git a/benchmark/README.md b/benchmark/README.md
@@ -0,0 +1,124 @@
+# Benchmark Result
+
+## N=100
+
+```bash
+hyhyperfine --warmup 5 --runs 10   \
+  "python simple/rs_bleuscore.py 100" \
+  "python simple/local_hf_bleu.py 100" \
+  "python simple/sacre_bleu.py 100"   \
+  "python simple/hf_evaluate.py 100"
+
+Benchmark 1: python simple/rs_bleuscore.py 100
+  Time (mean ± σ):      19.0 ms ±   2.6 ms    [User: 17.8 ms, System: 5.3 ms]
+  Range (min … max):    14.8 ms …  23.2 ms    10 runs
+
+Benchmark 2: python simple/local_hf_bleu.py 100
+  Time (mean ± σ):      21.5 ms ±   2.2 ms    [User: 19.0 ms, System: 2.5 ms]
+  Range (min … max):    16.8 ms …  24.1 ms    10 runs
+
+Benchmark 3: python simple/sacre_bleu.py 100
+  Time (mean ± σ):      45.9 ms ±   2.2 ms    [User: 38.7 ms, System: 7.1 ms]
+  Range (min … max):    43.5 ms …  50.9 ms    10 runs
+
+Benchmark 4: python simple/hf_evaluate.py 100
+  Time (mean ± σ):      4.504 s ±  0.429 s    [User: 0.762 s, System: 0.823 s]
+  Range (min … max):    4.163 s …  5.446 s    10 runs
+
+Summary
+  python simple/rs_bleuscore.py 100 ran
+    1.13 ± 0.20 times faster than python simple/local_hf_bleu.py 100
+    2.42 ± 0.35 times faster than python simple/sacre_bleu.py 100
+  237.68 ± 39.88 times faster than python simple/hf_evaluate.py 100
+```
+
+## N = 1K ~ 1M
+
+| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
+|:---|---:|---:|---:|---:|
+| `python simple/rs_bleuscore.py 1000` | 20.3 ± 1.3 | 18.2 | 21.4 | 1.00 |
+| `python simple/local_hf_bleu.py 1000` | 45.8 ± 1.2 | 44.2 | 47.5 | 2.26 ± 0.16 |
+| `python simple/rs_bleuscore.py 10000` | 37.8 ± 1.5 | 35.9 | 39.5 | 1.87 ± 0.14 |
+| `python simple/local_hf_bleu.py 10000` | 295.0 ± 5.9 | 288.6 | 304.2 | 14.55 ± 0.98 |
+| `python simple/rs_bleuscore.py 100000` | 219.6 ± 3.3 | 215.3 | 224.0 | 10.83 ± 0.72 |
+| `python simple/local_hf_bleu.py 100000` | 2781.4 ± 42.2 | 2723.1 | 2833.0 | 137.13 ± 9.10 |
+| `python simple/rs_bleuscore.py 1000000` | 2048.8 ± 31.4 | 2013.2 | 2090.3 | 101.01 ± 6.71 |
+| `python simple/local_hf_bleu.py 1000000` | 28285.3 ± 100.9 | 28182.1 | 28396.1 | 1394.51 ± 90.21 |
+
+
+
+
+### N=1000
+```bash
+hyhyperfine --warmup 5 --runs 10   \
+  "python simple/rs_bleuscore.py 1000" \
+  "python simple/local_hf_bleu.py 1000" \
+
+Benchmark 1: python simple/rs_bleuscore.py 1000
+  Time (mean ± σ):      20.8 ms ±   2.3 ms    [User: 31.5 ms, System: 11.3 ms]
+  Range (min … max):    17.4 ms …  24.1 ms    10 runs
+
+Benchmark 2: python simple/local_hf_bleu.py 1000
+  Time (mean ± σ):      44.8 ms ±   2.1 ms    [User: 41.3 ms, System: 3.4 ms]
+  Range (min … max):    41.2 ms …  48.0 ms    10 runs
+
+Summary
+  python simple/rs_bleuscore.py 1000 ran
+    2.15 ± 0.26 times faster than python simple/local_hf_bleu.py 1000
+```
+
+### N=10,000
+```bash
+hyhyperfine --warmup 5 --runs 10   \
+  "python simple/rs_bleuscore.py 10000" \
+  "python simple/local_hf_bleu.py 10000" \
+
+Benchmark 1: python simple/rs_bleuscore.py 10000
+  Time (mean ± σ):      38.3 ms ±   1.5 ms    [User: 132.4 ms, System: 102.2 ms]
+  Range (min … max):    36.3 ms …  40.3 ms    10 runs
+
+Benchmark 2: python simple/local_hf_bleu.py 10000
+  Time (mean ± σ):     298.9 ms ±   6.2 ms    [User: 295.5 ms, System: 3.1 ms]
+  Range (min … max):   289.6 ms … 312.2 ms    10 runs
+
+Summary
+  python simple/rs_bleuscore.py 10000 ran
+    7.79 ± 0.35 times faster than python simple/local_hf_bleu.py 10000
+```
+
+### N=100,000
+```bash
+hyhyperfine --warmup 5 --runs 10   \
+  "python simple/rs_bleuscore.py 100000" \
+  "python simple/local_hf_bleu.py 100000" \
+
+Benchmark 1: python simple/rs_bleuscore.py 100000
+  Time (mean ± σ):     222.2 ms ±   3.5 ms    [User: 962.7 ms, System: 1107.9 ms]
+  Range (min … max):   216.0 ms … 228.7 ms    10 runs
+
+Benchmark 2: python simple/local_hf_bleu.py 100000
+  Time (mean ± σ):      2.784 s ±  0.048 s    [User: 2.773 s, System: 0.010 s]
+  Range (min … max):    2.738 s …  2.896 s    10 runs
+
+Summary
+  python simple/rs_bleuscore.py 100000 ran
+   12.53 ± 0.29 times faster than python simple/local_hf_bleu.py 100000
+```
+
+### N=1,000,000
+```bash
+hyhyperfine --warmup 1 --runs 3   \
+  "python simple/rs_bleuscore.py 1000000" \
+  "python simple/local_hf_bleu.py 1000000"
+Benchmark 1: python simple/rs_bleuscore.py 1000000
+  Time (mean ± σ):      2.018 s ±  0.042 s    [User: 10.077 s, System: 10.327 s]
+  Range (min … max):    1.988 s …  2.066 s    3 runs
+
+Benchmark 2: python simple/local_hf_bleu.py 1000000
+  Time (mean ± σ):     27.760 s ±  0.373 s    [User: 27.687 s, System: 0.065 s]
+  Range (min … max):   27.335 s … 28.036 s    3 runs
+
+Summary
+  python simple/rs_bleuscore.py 1000000 ran
+   13.76 ± 0.34 times faster than python simple/local_hf_bleu.py 1000000
+```
diff --git a/benchmark/bench.png b/benchmark/bench.png
diff --git a/benchmark/bench.sh b/benchmark/bench.sh
@@ -1,5 +1,16 @@
-hyperfine --warmup 5 --runs 100   \
-  "python simple/rs_bleuscore.py" \
-  "python simple/local_hf_bleu.py" \
-  "python simple/sacre_bleu.py"   \
-  "python simple/hf_evaluate.py"
+hyperfine --warmup 5 --runs 10   \
+  "python simple/rs_bleuscore.py 100" \
+  "python simple/local_hf_bleu.py 100" \
+  "python simple/sacre_bleu.py 100"   \
+  "python simple/hf_evaluate.py 100"
+
+
+hyperfine --warmup 1 --runs 5 --export-markdown result.md --export-json result.json  \
+  "python simple/rs_bleuscore.py 1000" \
+  "python simple/local_hf_bleu.py 1000" \
+  "python simple/rs_bleuscore.py 10000" \
+  "python simple/local_hf_bleu.py 10000" \
+  "python simple/rs_bleuscore.py 100000" \
+  "python simple/local_hf_bleu.py 100000" \
+  "python simple/rs_bleuscore.py 1000000" \
+  "python simple/local_hf_bleu.py 1000000"
diff --git a/benchmark/bench_plot.py b/benchmark/bench_plot.py
@@ -0,0 +1,45 @@
+import json
+import matplotlib.pyplot as plt
+
+
+if __name__ == "__main__":
+
+    # Load the data
+    with open('result.json') as f:
+        data = json.load(f)
+
+    # Extract the mean times for each N
+    N_values = [1000, 10000, 100000, 1000000]
+    local_hf_bleu_times = []
+    rs_bleuscore_times = []
+    for result in data['results']:
+        N = int(result['command'].split()[-1])
+        if N in N_values:
+            if 'local_hf_bleu' in result['command']:
+                local_hf_bleu_times.append((N, result['mean']))
+            elif 'rs_bleuscore' in result['command']:
+                rs_bleuscore_times.append((N, result['mean']))
+    # Calculate the speedup multiples
+    speed_up_multiples = []
+    for i in range(len(N_values)):
+        local_hf_bleu_time = local_hf_bleu_times[i][1]
+        rs_bleuscore_time = rs_bleuscore_times[i][1]
+        speed_up_multiple = local_hf_bleu_time / rs_bleuscore_time
+        speed_up_multiples.append(speed_up_multiple)
+
+    # Plot the bar plot
+    plt.bar(N_values, [t[1] for t in local_hf_bleu_times], label='local_hf_bleu', width=N_values, color='#3498db')
+    plt.bar(N_values, [t[1] for t in rs_bleuscore_times], label='rs_bleuscore', width=N_values, color='#f1c40f')
+    plt.xscale('log')  # log scale for x-axis
+    plt.xlabel('Corpus Size (N)', fontsize=14)
+    plt.ylabel('Mean Time (Second)', fontsize=14)
+    plt.title('Performance Comparison', fontsize=14)
+    plt.legend(fontsize=14)
+    plt.tight_layout()
+
+    # Add the speedup multiples as annotations
+    for i, N in enumerate(N_values):
+        plt.annotate(f'x{speed_up_multiples[i]:.2f}', (N, local_hf_bleu_times[i][1]),
+                     ha='center', va='bottom', fontsize=12)
+
+    plt.savefig("bench.png", dpi=300)