Updated radix-sort benchmark with CUDA8 RC results

Bulat-Ziganshin · Jun 18, 2016 · 2595dc6 · 2595dc6
1 parent 1d362aa
commit 2595dc6
Show file tree

Hide file tree

Showing 6 changed files with 252 additions and 144 deletions.
diff --git a/algo_st/README.md b/algo_st/README.md
@@ -9,8 +9,8 @@ Further optimizations:
 
 Use some combination of the following ideas to shave off remaining times over 65 ms
 - overload pre/post-sorting procedures and RLE compression with memcpy
-- process only 4-byte elements at last sorting stages, and simultaneously copy-in next block to process - 
-4-byte sorting should also be faster than sorting of 4+4 (key+value) bytes (43 ms)
+- process only 4-byte elements at last sorting stages, and simultaneously copy-in next block to process -
+4-byte sorting should also be faster than sorting of 4+4 (key+value) bytes (43 ms total instead of 65 ms)
 - use zero-copy memory instead of copy in/out
 
 So, after all optimizations, ST4 should become more than 3x faster!
diff --git a/app_bslab/README.md b/app_bslab/README.md
@@ -6,6 +6,7 @@
 [results-cpu.txt]:   results-cpu.txt
 [results-cuda.txt]:  results-cuda.txt
 [profile.txt]:       profile.txt
+[bench.cmd]:         bench.cmd
 
 
 BSLab stands for the block-sorting laboratory.
@@ -70,6 +71,8 @@ Also, we see that
 - [results-cuda.txt] are my CUDA GPU results
 - [profile.txt] are my CUDA GPU profiling report (only MTF kernels are included)
 
+See [bench.cmd] for benchmarking/profiling cmdlines.
+
 
 ### x64: enwik9 results on Haswell i7-4770
 ```

diff --git a/app_bslab/bench.cmd b/app_bslab/bench.cmd
@@ -1,2 +1,3 @@
 for %a in (boost e8 e9 100m 1g 1g.tor3) do for %x in (-x64-avx2.exe -x64.exe -avx2.exe .exe) do for %c in (icl clang gcc msvc) do bslab-%c%x z:\%a
 for %a in (boost e8 e9 100m 1g 1g.tor3) do for %e in (bslab-cuda-x64.exe bslab-cuda.exe) do %e -nogpu z:\%a
+"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvprof.exe" --events all --metrics all  --log-file profile --replay-mode application bslab-cuda-x64.exe -bwt11 -lzp3 -mtf-1-4 z:\e8
diff --git a/app_bslab/bslab.cpp b/app_bslab/bslab.cpp
@@ -127,7 +127,7 @@ int main (int argc, char **argv)
         bufsize <<= 20;  // if value is small enough, consider it as mebibytes
 
     if (!(argc==2 || argc==3) || error) {
-        printf ("BSL: the block-sorting lab.  Part of https://github.com/Bulat-Ziganshin/Compression-Research\n"
+        printf ("BSL: the block-sorting lab 1.0 (June 18 2016).  Part of https://github.com/Bulat-Ziganshin/Compression-Research\n"
                 "Usage: bsl [options] infile [outfile]\n"
                 "  -bN      buffer N (mebi)bytes (default %d MiB - reduce if program fails)\n"
                 "  -nogpu   skip GPU name output\n"

diff --git a/app_radix_sort/README.md b/app_radix_sort/README.md
@@ -10,52 +10,53 @@ First column has the format "N/K" for sorting K-byte keys by N bytes.
 It has format "N/K+V" for sorting with extra V-byte values attached to the keys.
 
 
-### Current x86 results with CUDA 7.5 and CUB 1.5.2
+### x64 results with CUDA 8.0RC and CUB 1.5.2
 
-(x64 version is a few percents slower due to need to manage larger pointers)
+(x64 version is a few percents slower than x86 due to need to manage larger pointers). 
+For full results see [results.txt](results.txt).
 
 ```
 GeForce GTX 560 Ti, CC 2.1.  VRAM 1.0 GB, 2004 MHz * 256-bit = 128 GB/s.  8 SM * 48 alu * 1800 MHz * 2 = 1.38 TFLOPS
 Sorting 16M elements:
-1/4  : Throughput = 3630.096 MElements/s, Time = 4.622 ms
-2/4  : Throughput = 1807.721 MElements/s, Time = 9.281 ms
-3/4  : Throughput = 1325.778 MElements/s, Time = 12.655 ms
-4/4  : Throughput =  941.682 MElements/s, Time = 17.816 ms
-
-1/8  : Throughput = 2033.248 MElements/s, Time = 8.251 ms
-2/8  : Throughput = 1013.995 MElements/s, Time = 16.546 ms
-3/8  : Throughput =  729.117 MElements/s, Time = 23.010 ms
-4/8  : Throughput =  525.525 MElements/s, Time = 31.925 ms
-5/8  : Throughput =  442.132 MElements/s, Time = 37.946 ms
-6/8  : Throughput =  361.177 MElements/s, Time = 46.452 ms
-7/8  : Throughput =  305.574 MElements/s, Time = 54.904 ms
-8/8  : Throughput =  271.861 MElements/s, Time = 61.712 ms
-
-1/4+4: Throughput = 2345.812 MElements/s, Time = 7.152 ms
-2/4+4: Throughput = 1173.353 MElements/s, Time = 14.299 ms
-3/4+4: Throughput =  874.986 MElements/s, Time = 19.174 ms
-4/4+4: Throughput =  609.576 MElements/s, Time = 27.523 ms
-
-1/4+8: Throughput = 1737.907 MElements/s, Time = 9.654 ms
-2/4+8: Throughput =  869.434 MElements/s, Time = 19.297 ms
-3/4+8: Throughput =  577.189 MElements/s, Time = 29.067 ms
-4/4+8: Throughput =  428.730 MElements/s, Time = 39.132 ms
-
-1/8+4: Throughput = 1483.201 MElements/s, Time = 11.311 ms
-2/8+4: Throughput =  743.381 MElements/s, Time = 22.569 ms
-3/8+4: Throughput =  517.357 MElements/s, Time = 32.429 ms
-4/8+4: Throughput =  378.284 MElements/s, Time = 44.351 ms
-5/8+4: Throughput =  312.690 MElements/s, Time = 53.654 ms
-6/8+4: Throughput =  258.132 MElements/s, Time = 64.995 ms
-7/8+4: Throughput =  220.360 MElements/s, Time = 76.135 ms
-8/8+4: Throughput =  194.696 MElements/s, Time = 86.171 ms
-
-1/8+8: Throughput = 1261.976 MElements/s, Time = 13.294 ms
-2/8+8: Throughput =  630.888 MElements/s, Time = 26.593 ms
-3/8+8: Throughput =  421.866 MElements/s, Time = 39.769 ms
-4/8+8: Throughput =  315.385 MElements/s, Time = 53.196 ms
-5/8+8: Throughput =  256.326 MElements/s, Time = 65.453 ms
-6/8+8: Throughput =  213.615 MElements/s, Time = 78.539 ms
-7/8+8: Throughput =  183.989 MElements/s, Time = 91.186 ms
-8/8+8: Throughput =  161.552 MElements/s, Time = 103.851 ms
+1/4  : Throughput = 3532.966 MElements/s, Time = 4.749 ms
+2/4  : Throughput = 1765.983 MElements/s, Time = 9.500 ms
+3/4  : Throughput = 1298.415 MElements/s, Time = 12.921 ms
+4/4  : Throughput =  921.279 MElements/s, Time = 18.211 ms
+
+1/8  : Throughput = 1976.709 MElements/s, Time = 8.487 ms
+2/8  : Throughput =  988.398 MElements/s, Time = 16.974 ms
+3/8  : Throughput =  715.334 MElements/s, Time = 23.454 ms
+4/8  : Throughput =  515.346 MElements/s, Time = 32.555 ms
+5/8  : Throughput =  434.126 MElements/s, Time = 38.646 ms
+6/8  : Throughput =  354.241 MElements/s, Time = 47.361 ms
+7/8  : Throughput =  299.286 MElements/s, Time = 56.057 ms
+8/8  : Throughput =  266.688 MElements/s, Time = 62.909 ms
+
+1/4+4: Throughput = 2346.395 MElements/s, Time = 7.150 ms
+2/4+4: Throughput = 1170.342 MElements/s, Time = 14.335 ms
+3/4+4: Throughput =  868.724 MElements/s, Time = 19.312 ms
+4/4+4: Throughput =  606.442 MElements/s, Time = 27.665 ms
+
+1/4+8: Throughput = 1731.703 MElements/s, Time = 9.688 ms
+2/4+8: Throughput =  868.007 MElements/s, Time = 19.328 ms
+3/4+8: Throughput =  574.224 MElements/s, Time = 29.217 ms
+4/4+8: Throughput =  425.807 MElements/s, Time = 39.401 ms
+
+1/8+4: Throughput = 1447.258 MElements/s, Time = 11.592 ms
+2/8+4: Throughput =  725.238 MElements/s, Time = 23.133 ms
+3/8+4: Throughput =  506.463 MElements/s, Time = 33.126 ms
+4/8+4: Throughput =  370.368 MElements/s, Time = 45.299 ms
+5/8+4: Throughput =  306.365 MElements/s, Time = 54.762 ms
+6/8+4: Throughput =  252.914 MElements/s, Time = 66.336 ms
+7/8+4: Throughput =  215.716 MElements/s, Time = 77.774 ms
+8/8+4: Throughput =  190.831 MElements/s, Time = 87.917 ms
+
+1/8+8: Throughput = 1255.114 MElements/s, Time = 13.367 ms
+2/8+8: Throughput =  627.455 MElements/s, Time = 26.739 ms
+3/8+8: Throughput =  418.790 MElements/s, Time = 40.061 ms
+4/8+8: Throughput =  312.513 MElements/s, Time = 53.685 ms
+5/8+8: Throughput =  254.506 MElements/s, Time = 65.921 ms
+6/8+8: Throughput =  212.192 MElements/s, Time = 79.066 ms
+7/8+8: Throughput =  183.198 MElements/s, Time = 91.579 ms
+8/8+8: Throughput =  160.342 MElements/s, Time = 104.634 ms
 ```