Skip to content

Commit

Permalink
Updated radix-sort benchmark with CUDA8 RC results
Browse files Browse the repository at this point in the history
  • Loading branch information
Bulat-Ziganshin committed Jun 18, 2016
1 parent 1d362aa commit 2595dc6
Show file tree
Hide file tree
Showing 6 changed files with 252 additions and 144 deletions.
4 changes: 2 additions & 2 deletions algo_st/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ Further optimizations:

Use some combination of the following ideas to shave off remaining times over 65 ms
- overload pre/post-sorting procedures and RLE compression with memcpy
- process only 4-byte elements at last sorting stages, and simultaneously copy-in next block to process -
4-byte sorting should also be faster than sorting of 4+4 (key+value) bytes (43 ms)
- process only 4-byte elements at last sorting stages, and simultaneously copy-in next block to process -
4-byte sorting should also be faster than sorting of 4+4 (key+value) bytes (43 ms total instead of 65 ms)
- use zero-copy memory instead of copy in/out

So, after all optimizations, ST4 should become more than 3x faster!
3 changes: 3 additions & 0 deletions app_bslab/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
[results-cpu.txt]: results-cpu.txt
[results-cuda.txt]: results-cuda.txt
[profile.txt]: profile.txt
[bench.cmd]: bench.cmd


BSLab stands for the block-sorting laboratory.
Expand Down Expand Up @@ -70,6 +71,8 @@ Also, we see that
- [results-cuda.txt] are my CUDA GPU results
- [profile.txt] are my CUDA GPU profiling report (only MTF kernels are included)

See [bench.cmd] for benchmarking/profiling cmdlines.


### x64: enwik9 results on Haswell i7-4770
```
Expand Down
1 change: 1 addition & 0 deletions app_bslab/bench.cmd
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
for %a in (boost e8 e9 100m 1g 1g.tor3) do for %x in (-x64-avx2.exe -x64.exe -avx2.exe .exe) do for %c in (icl clang gcc msvc) do bslab-%c%x z:\%a
for %a in (boost e8 e9 100m 1g 1g.tor3) do for %e in (bslab-cuda-x64.exe bslab-cuda.exe) do %e -nogpu z:\%a
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvprof.exe" --events all --metrics all --log-file profile --replay-mode application bslab-cuda-x64.exe -bwt11 -lzp3 -mtf-1-4 z:\e8
2 changes: 1 addition & 1 deletion app_bslab/bslab.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ int main (int argc, char **argv)
bufsize <<= 20; // if value is small enough, consider it as mebibytes

if (!(argc==2 || argc==3) || error) {
printf ("BSL: the block-sorting lab. Part of https://github.com/Bulat-Ziganshin/Compression-Research\n"
printf ("BSL: the block-sorting lab 1.0 (June 18 2016). Part of https://github.com/Bulat-Ziganshin/Compression-Research\n"
"Usage: bsl [options] infile [outfile]\n"
" -bN buffer N (mebi)bytes (default %d MiB - reduce if program fails)\n"
" -nogpu skip GPU name output\n"
Expand Down
87 changes: 44 additions & 43 deletions app_radix_sort/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,52 +10,53 @@ First column has the format "N/K" for sorting K-byte keys by N bytes.
It has format "N/K+V" for sorting with extra V-byte values attached to the keys.


### Current x86 results with CUDA 7.5 and CUB 1.5.2
### x64 results with CUDA 8.0RC and CUB 1.5.2

(x64 version is a few percents slower due to need to manage larger pointers)
(x64 version is a few percents slower than x86 due to need to manage larger pointers).
For full results see [results.txt](results.txt).

```
GeForce GTX 560 Ti, CC 2.1. VRAM 1.0 GB, 2004 MHz * 256-bit = 128 GB/s. 8 SM * 48 alu * 1800 MHz * 2 = 1.38 TFLOPS
Sorting 16M elements:
1/4 : Throughput = 3630.096 MElements/s, Time = 4.622 ms
2/4 : Throughput = 1807.721 MElements/s, Time = 9.281 ms
3/4 : Throughput = 1325.778 MElements/s, Time = 12.655 ms
4/4 : Throughput = 941.682 MElements/s, Time = 17.816 ms
1/8 : Throughput = 2033.248 MElements/s, Time = 8.251 ms
2/8 : Throughput = 1013.995 MElements/s, Time = 16.546 ms
3/8 : Throughput = 729.117 MElements/s, Time = 23.010 ms
4/8 : Throughput = 525.525 MElements/s, Time = 31.925 ms
5/8 : Throughput = 442.132 MElements/s, Time = 37.946 ms
6/8 : Throughput = 361.177 MElements/s, Time = 46.452 ms
7/8 : Throughput = 305.574 MElements/s, Time = 54.904 ms
8/8 : Throughput = 271.861 MElements/s, Time = 61.712 ms
1/4+4: Throughput = 2345.812 MElements/s, Time = 7.152 ms
2/4+4: Throughput = 1173.353 MElements/s, Time = 14.299 ms
3/4+4: Throughput = 874.986 MElements/s, Time = 19.174 ms
4/4+4: Throughput = 609.576 MElements/s, Time = 27.523 ms
1/4+8: Throughput = 1737.907 MElements/s, Time = 9.654 ms
2/4+8: Throughput = 869.434 MElements/s, Time = 19.297 ms
3/4+8: Throughput = 577.189 MElements/s, Time = 29.067 ms
4/4+8: Throughput = 428.730 MElements/s, Time = 39.132 ms
1/8+4: Throughput = 1483.201 MElements/s, Time = 11.311 ms
2/8+4: Throughput = 743.381 MElements/s, Time = 22.569 ms
3/8+4: Throughput = 517.357 MElements/s, Time = 32.429 ms
4/8+4: Throughput = 378.284 MElements/s, Time = 44.351 ms
5/8+4: Throughput = 312.690 MElements/s, Time = 53.654 ms
6/8+4: Throughput = 258.132 MElements/s, Time = 64.995 ms
7/8+4: Throughput = 220.360 MElements/s, Time = 76.135 ms
8/8+4: Throughput = 194.696 MElements/s, Time = 86.171 ms
1/8+8: Throughput = 1261.976 MElements/s, Time = 13.294 ms
2/8+8: Throughput = 630.888 MElements/s, Time = 26.593 ms
3/8+8: Throughput = 421.866 MElements/s, Time = 39.769 ms
4/8+8: Throughput = 315.385 MElements/s, Time = 53.196 ms
5/8+8: Throughput = 256.326 MElements/s, Time = 65.453 ms
6/8+8: Throughput = 213.615 MElements/s, Time = 78.539 ms
7/8+8: Throughput = 183.989 MElements/s, Time = 91.186 ms
8/8+8: Throughput = 161.552 MElements/s, Time = 103.851 ms
1/4 : Throughput = 3532.966 MElements/s, Time = 4.749 ms
2/4 : Throughput = 1765.983 MElements/s, Time = 9.500 ms
3/4 : Throughput = 1298.415 MElements/s, Time = 12.921 ms
4/4 : Throughput = 921.279 MElements/s, Time = 18.211 ms
1/8 : Throughput = 1976.709 MElements/s, Time = 8.487 ms
2/8 : Throughput = 988.398 MElements/s, Time = 16.974 ms
3/8 : Throughput = 715.334 MElements/s, Time = 23.454 ms
4/8 : Throughput = 515.346 MElements/s, Time = 32.555 ms
5/8 : Throughput = 434.126 MElements/s, Time = 38.646 ms
6/8 : Throughput = 354.241 MElements/s, Time = 47.361 ms
7/8 : Throughput = 299.286 MElements/s, Time = 56.057 ms
8/8 : Throughput = 266.688 MElements/s, Time = 62.909 ms
1/4+4: Throughput = 2346.395 MElements/s, Time = 7.150 ms
2/4+4: Throughput = 1170.342 MElements/s, Time = 14.335 ms
3/4+4: Throughput = 868.724 MElements/s, Time = 19.312 ms
4/4+4: Throughput = 606.442 MElements/s, Time = 27.665 ms
1/4+8: Throughput = 1731.703 MElements/s, Time = 9.688 ms
2/4+8: Throughput = 868.007 MElements/s, Time = 19.328 ms
3/4+8: Throughput = 574.224 MElements/s, Time = 29.217 ms
4/4+8: Throughput = 425.807 MElements/s, Time = 39.401 ms
1/8+4: Throughput = 1447.258 MElements/s, Time = 11.592 ms
2/8+4: Throughput = 725.238 MElements/s, Time = 23.133 ms
3/8+4: Throughput = 506.463 MElements/s, Time = 33.126 ms
4/8+4: Throughput = 370.368 MElements/s, Time = 45.299 ms
5/8+4: Throughput = 306.365 MElements/s, Time = 54.762 ms
6/8+4: Throughput = 252.914 MElements/s, Time = 66.336 ms
7/8+4: Throughput = 215.716 MElements/s, Time = 77.774 ms
8/8+4: Throughput = 190.831 MElements/s, Time = 87.917 ms
1/8+8: Throughput = 1255.114 MElements/s, Time = 13.367 ms
2/8+8: Throughput = 627.455 MElements/s, Time = 26.739 ms
3/8+8: Throughput = 418.790 MElements/s, Time = 40.061 ms
4/8+8: Throughput = 312.513 MElements/s, Time = 53.685 ms
5/8+8: Throughput = 254.506 MElements/s, Time = 65.921 ms
6/8+8: Throughput = 212.192 MElements/s, Time = 79.066 ms
7/8+8: Throughput = 183.198 MElements/s, Time = 91.579 ms
8/8+8: Throughput = 160.342 MElements/s, Time = 104.634 ms
```
Loading

0 comments on commit 2595dc6

Please sign in to comment.