CIS565-Fall-2021 · sona1111 · Sep 22, 2021 · Sep 22, 2021
diff --git a/README.md b/README.md
@@ -3,12 +3,111 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Paul (San) Jewell
+  * [LinkedIn](https://www.linkedin.com/in/paul-jewell-2aba7379), [work website](
+    https://www.biociphers.org/paul-jewell-lab-member), [personal website](https://gitlab.com/inklabapp), [twitter](https://twitter.com/inklabapp), etc.
+* Tested on: (TODO) Linux pop-os 5.11.0-7614-generic, i7-9750H CPU @ 2.60GHz 32GB, GeForce GTX 1650 Mobile / Max-Q 4GB 
 
-### (TODO: Your README)
+### GPU scan and stream compaction demonstration
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+This repository contains some comparisons of basic algorithms compared between a CPU serial implementation, my GPU CUDA implementation, 
+and similar functions provided by the 'thrust' library. 
 
+I demonstrate usage of the naive, buffered, work-efficient, and chunked work-efficient scan algorithms as described in 
+[this link](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda).
+The implementation is also broken down into a number of sub-functions for easier comprehension by the curious reader. 
+
+### Performance comparison overview
+
+After manually testing a number of block sizes, I've found that a size of 64 works best for the efficient scan 
+implementation. (same as for boids)and 512 (the largest I can do) works best for the naive implementation. (this 
+pretty much makes sense as it is a largely processor-count-limited algorithm). Thrust does not allow specifying a 
+manual block size that I could find so I didn't do any testing there. 
+
+![boidstats](img/chart1.png)
+
+For a first iteration, I was unable to run larger sizes without causing OOM on my local machine
+ (desktop completely crashed), so I tested up to 1mil only. Clearly something seems to have goofed in either 
+ a single blocking line in my efficient implementation, or the measurement code location. I don't believe a simple
+ inefficiency would lead to the huge discrepancy between naive and efficient implementations here. The no-chunking
+ implementation should do the same amount of operations for the up and down sweep as the chunking implementation (ideal),
+ however, it will get the incorrect result. There is something very inefficient with the chunking that I've not yet been 
+ able to work out.
+
+```
+  ****************
+  ** SCAN TESTS **
+  ****************
+      [  29  18  29  17  11   3  19  31   3  49  25  31  31 ...  14   0 ]
+  ==== cpu scan, power-of-two ====
+     elapsed time: 0.000686ms    (std::chrono Measured)
+      [  29  47  76  93 104 107 126 157 160 209 234 265 296 ... 23987 23987 ]
+  ==== cpu scan, non-power-of-two ====
+     elapsed time: 0.000377ms    (std::chrono Measured)
+      [  29  47  76  93 104 107 126 157 160 209 234 265 296 ... 23950 23960 ]
+      passed 
+  ==== naive scan, power-of-two ====
+     elapsed time: 0.060032ms    (CUDA Measured)
+      [  29  47  76  93 104 107 126 157 160 209 234 265 296 ... 23987 23987 ]
+  ==== naive scan, non-power-of-two ====
+     elapsed time: 0.055296ms    (CUDA Measured)
+      [  29  47  76  93 104 107 126 157 160 209 234 265 296 ... 23950 23960 ]
+      passed 
+  ==== work-efficient scan no chunk, power-of-two ====
+     elapsed time: 0.087648ms    (CUDA Measured)
+      [  29  47  76  93 104 107 126 157 160 209 234 265 296 ... 23987 23987 ]
+      passed 
+  ==== work-efficient scan no chunk, non-power-of-two ====
+     elapsed time: 0.041152ms    (CUDA Measured)
+      [  29  47  76  93 104 107 126 157 160 209 234 265 296 ... 23950 23960 ]
+      passed 
+  ==== work-efficient scan, power-of-two ====
+     elapsed time: 0.04512ms    (CUDA Measured)
+      [  29  47  76  93 104 107 126 157 160 209 234 265 296 ... 23987 23987 ]
+      passed 
+  ==== work-efficient scan, non-power-of-two ====
+     elapsed time: 0.058752ms    (CUDA Measured)
+      [  29  47  76  93 104 107 126 157 160 209 234 265 296 ... 23950 23960 ]
+      passed 
+  ==== thrust scan, power-of-two ====
+     elapsed time: 0.022528ms    (CUDA Measured)
+      [  29  47  76  93 104 107 126 157 160 209 234 265 296 ... 23987 23987 ]
+      passed 
+  ==== thrust scan, non-power-of-two ====
+     elapsed time: 0.021024ms    (CUDA Measured)
+      [  29  47  76  93 104 107 126 157 160 209 234 265 296 ... 23950 23960 ]
+      passed 
+
+  *****************************
+  ** STREAM COMPACTION TESTS **
+  *****************************
+      [   1   2   1   3   1   1   3   1   3   3   3   3   3 ...   0   0 ]
+      [   1   2   1   3   1   1   3   1   3   3   3   3   3 ...   2   2 ]
+  ==== cpu compact without scan, power-of-two ====
+     elapsed time: 0.002584ms    (std::chrono Measured)
+      [   1   2   1   3   1   1   3   1   3   3   3   3   3 ...   2   3 ]
+      passed 
+  ==== cpu compact without scan, non-power-of-two ====
+     elapsed time: 0.001953ms    (std::chrono Measured)
+      [   1   2   1   3   1   1   3   1   3   3   3   3   3 ...   2   2 ]
+      passed 
+  ==== cpu compact with scan ====
+     elapsed time: 0.004194ms    (std::chrono Measured)
+      [   1   2   1   3   1   1   3   1   3   3   3   3   3 ...   2   3 ]
+      passed 
+  ==== cpu compact with scan, non-power-of-two ====
+     elapsed time: 0.003411ms    (std::chrono Measured)
+      [   1   2   1   3   1   1   3   1   3   3   3   3   3 ...   2   2 ]
+      passed 
+  ==== work-efficient compact, power-of-two ====
+     elapsed time: 0.065216ms    (CUDA Measured)
+      [   1   2   1   3   1   1   3   1   3   3   3   3   3 ...   2   3 ]
+      passed 
+  ==== work-efficient compact, non-power-of-two ====
+     elapsed time: 0.063552ms    (CUDA Measured)
+      [   1   2   1   3   1   1   3   1   3   3   3   3   3 ...   2   2 ]
+      passed 
+  sh: 1: pause: not found
+
+  Process finished with exit code 0
+```
diff --git a/img/chart1.png b/img/chart1.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,7 +13,9 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+//const int SIZE = 1 << 8; // feel free to change the size of array
+//const int SIZE = 10000; // feel free to change the size of array
+const int SIZE = 1000;
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -29,6 +31,7 @@ int main(int argc, char* argv[]) {
 
     genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
     a[SIZE - 1] = 0;
+
     printArray(SIZE, a, true);
 
     // initialize b using StreamCompaction::CPU::scan you implement
@@ -51,48 +54,102 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
-
-    /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
-    onesArray(SIZE, c);
-    printDesc("1s array for finding bugs");
-    StreamCompaction::Naive::scan(SIZE, c, a);
-    printArray(SIZE, c, true); */
+    printArray(SIZE, c, true);
 
-    zeroArray(SIZE, c);
+    zeroArray(NPOT, c);
     printDesc("naive scan, non-power-of-two");
     StreamCompaction::Naive::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(NPOT, c, true);
+    printCmpResult(NPOT, b, c);
+
+
+     // For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
+
+//  onesArray(SIZE, a);
+//  zeroArray(SIZE, c);
+//  printDesc("1s array for finding bugs");
+//  StreamCompaction::Naive::scan(NPOT, c, a);
+//  printArray(NPOT, c);
+
+//  onesArray(NPOT, a);
+//  zeroArray(NPOT, c);
+//  printDesc("1s array for finding bugs");
+//  StreamCompaction::Naive::scan(NPOT, c, a);
+//  printArray(NPOT, c);
+//
+//  onesArray(SIZE, a);
+//  zeroArray(SIZE, c);
+//  printDesc("1s array for finding bugs");
+//  StreamCompaction::Efficient::scan(SIZE, c, a);
+//  printArray(SIZE, c);
+
+
+//  onesArray(3, a);
+//  a[0] = 4;
+//  a[1] = 4;
+//  a[2] = 1;
+//  zeroArray(3, c);
+//  printDesc("1s array for finding bugs");
+//  StreamCompaction::Efficient::scanplain(3, c, a);
+//  printArray(3, c);
+//
+//     onesArray(SIZE, a);
+//    zeroArray(SIZE, c);
+//    printDesc("1s array for finding bugs");
+//    StreamCompaction::Efficient::scan(SIZE, c, a);
+//    printArray(SIZE, c);
+//
+//  onesArray(SIZE, a);
+//  zeroArray(SIZE, c);
+//  printDesc("1s array for finding bugs");
+//  StreamCompaction::Efficient::scan(SIZE, c, a);
+//  printArray(SIZE, c);
+
+    zeroArray(SIZE, c);
+    printDesc("work-efficient scan no chunk, power-of-two");
+    StreamCompaction::Efficient::scanplain(SIZE, c, a);
+    exclusive2inclusive(SIZE, c, a[SIZE-1]);
+    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("work-efficient scan no chunk, non-power-of-two");
+    StreamCompaction::Efficient::scanplain(NPOT, c, a);
+    exclusive2inclusive(NPOT, c, a[NPOT-1]);
+    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
+    exclusive2inclusive(SIZE, c, a[SIZE-1]);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
     StreamCompaction::Efficient::scan(NPOT, c, a);
+    exclusive2inclusive(NPOT, c, a[NPOT-1]);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, power-of-two");
     StreamCompaction::Thrust::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, non-power-of-two");
     StreamCompaction::Thrust::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     printf("\n");
@@ -104,7 +161,16 @@ int main(int argc, char* argv[]) {
 
     genArray(SIZE - 1, a, 4);  // Leave a 0 at the end to test that edge case
     a[SIZE - 1] = 0;
+  //a[NPOT - 1] = 0;
+//
+//  a[0] = 0;
+//  a[1] = 2;
+//  a[2] = 1;
+//  a[3] = 2;
+//  a[4] = 1;
+
     printArray(SIZE, a, true);
+    printArray(NPOT, a, true);
 
     int count, expectedCount, expectedNPOT;
 
@@ -133,18 +199,25 @@ int main(int argc, char* argv[]) {
     printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
+    zeroArray(SIZE, c);
+    printDesc("cpu compact with scan, non-power-of-two");
+    count = StreamCompaction::CPU::compactWithScan(NPOT, c, a);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+    printArray(count, c, true);
+    printCmpLenResult(count, expectedNPOT, b, c);
+
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
-    zeroArray(SIZE, c);
+    zeroArray(NPOT, c);
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     system("pause"); // stop Win32 console from closing on exit

diff --git a/stream_compaction/common.h b/stream_compaction/common.h
@@ -30,6 +30,13 @@ inline int ilog2ceil(int x) {
     return x == 1 ? 0 : ilog2(x - 1) + 1;
 }
 
+inline void exclusive2inclusive(int n, int* scanned, int lastvalue){
+  for(int i=0; i<n-1; i++){
+    scanned[i] = scanned[i+1];
+  }
+  scanned[n-1] = scanned[n-2] + lastvalue;
+}
+
 namespace StreamCompaction {
     namespace Common {
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata);