CIS565-Fall-2021 · StephenLee129 · Sep 19, 2021 · Sep 19, 2021 · Sep 21, 2021 · Sep 21, 2021
diff --git a/README.md b/README.md
@@ -3,12 +3,147 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Stephen Lee
+  * [LinkedIn](https://www.linkedin.com/in/stephen-lee-bb5a40163/)
+* Tested on: Windows 10, i7-9750H @2.60GHz, RTX 2060 6GB (personal laptop)
 
-### (TODO: Your README)
+# Project Overview
+The goal of this project was to analyze the performance of some useful algorithms. These `scan`, which computes a prefix sum on an input array. More can be read about prefix sums [here](https://en.wikipedia.org/wiki/Prefix_sum), as well as more information on the algorithms presented in this project. We then iterated on this scan implementation, creating a serial CPU algorithm, a naive parallel GPU algorithm, and a work-efficient parallel GPU algorithm. We then applied `scan` to compute other useful array operations such as `compaction` which is described in more detail [here](https://stackoverflow.com/questions/34059753/cuda-stream-compaction-algorithm).
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+### Features
+* CPU `scan` and `compaction`
+* Naive GPU implementation of `scan`
+* Work-efficient GPU implementation of `scan` and `compaction`
+* Wrapper for testing Thrust's `scan` implementation
+
+# Performance Analysis
+Performance was measured using provided GPU and CPU timing classes, which could accurately time each algorithm's runtime in milliseconds (ms). Performance analysis began by determining the optimal block size for which testing should be done. In all the graphs that will be presented, a few abbreviations were made to make the graphs more readable. These are as follows:
+
+* WE = work-efficient
+* POT = input size that is a power of 2
+* NPOT = input size that is not a power of 2
+
+
+### Finding an optimal block size
+Block size refers to the number of threads defined in a block of threads in the launch paramters for the GPU parallel algorithms implemented. As such, only the GPU algorithms implemented in this project (naive and work-efficient) were tested here. A fixed input size of ~64,000,000 elements was used as a reasonably large input that would be conducive to using a CPU. Performance was measured for blocks whose sizes corresponded powers of 2 to work well with the underlying hardware in the range of 4 threads to 1024 threads in a block. The results are summarized in the graph below:
+
+<img src="img/blocks.PNG">
+
+From this analysis, it was found that a block size of 128 threads per block made the most sense for all of our GPU algorithms on this test setup.
+
+### Comparing scan algorithms
+Once an optimal block size was determined, this was used to collect data on how the naive and work-efficient GPU implementations measured up to both the CPU version and the Thrust library which its own GPU implementation. The results for `scan` are summarized in the graph below:
+
+<img src="img/scan.PNG">
+
+It was found that input sizes below 1,000,000 elements tended to favor the CPU implementation as opposed to any of the GPU implementations. This is pretty sensible since the benefits of GPU programming are really only seen as input sizes get really large and we can leverage more parallel computations. As such, I have only plotted data for input sizes above 1,000,000 since smaller inputs were not particularly interesting. The naive implementation of `scan` tended to perform better than the work-efficient implementation up until 32,000,000 elements, at which point the work-efficient method took over. In both cases, Thrust's implementation is far superior saving much more time than either of my implementations.
+
+This difference in performance can likely be attributed to a memory I/O bottleneck since both the naive and work-efficient implementations are doing many reads and writes to global memory, and there aren't really any other computations that are being done to cover up this latency. Regarding the differences observed between the naive implementation and the work-efficient implementation, this is likely due to the asymptotic behavior of the implementations, where the naive implementation has fewer serial operations for smaller input sizes than the work-efficient version.
+
+### Comparing compaction algorithms
+After analyzing the performance of `scan`, the `compaction` algorithm was used as an application of what we had just implemented. In this case, we compare the perfomance of a purely CPU-inspired approach, a CPU approach that uses `scan` as a intermediate computation, and a GPU approach that uses our work-efficient implementation. Like with the scan implementation, we didn't really see the GPU pull ahead of the CPU until 1,000,000 elements were added into the array, so the graph below starts at this point.
+
+<img src="img/compaction.PNG">
+
+Here we see some similar trends, where the difference between the CPU versions and the work-efficient GPU version increased with increasing input size (e.g. the difference between the lines got bigger with larger inputs). Since this algorithm is largely an extension of `scan` with a few extra computations, it makes sense that we see a very similar trend here.
+
+Here's a sample output with 131072 elements to be processed:
+```
+****************
+** SCAN TESTS **
+****************
+    [   8  20  33  38  32  34  23   5  38   7  22  39  15 ...  24   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.1906ms    (std::chrono Measured)
+    [   0   8  28  61  99 131 165 188 193 231 238 260 299 ... 3206154 3206178 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.1897ms    (std::chrono Measured)
+    [   0   8  28  61  99 131 165 188 193 231 238 260 299 ... 3206075 3206103 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.346464ms    (CUDA Measured)
+    [   0   8  28  61  99 131 165 188 193 231 238 260 299 ... 3206154 3206178 ]
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.249376ms    (CUDA Measured)
+    [   0   8  28  61  99 131 165 188 193 231 238 260 299 ...   0   0 ]
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.535904ms    (CUDA Measured)
+    [   0   8  28  61  99 131 165 188 193 231 238 260 299 ... 3206154 3206178 ]
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.49104ms    (CUDA Measured)
+    [   0   8  28  61  99 131 165 188 193 231 238 260 299 ... 3206075 3206103 ]
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.4896ms    (CUDA Measured)
+    [   0   8  28  61  99 131 165 188 193 231 238 260 299 ... 3206154 3206178 ]
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.416352ms    (CUDA Measured)
+    [   0   8  28  61  99 131 165 188 193 231 238 260 299 ... 3206075 3206103 ]
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   2   2   1   2   2   0   1   3   0   1   0   3   1 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.3009ms    (std::chrono Measured)
+    [   2   2   1   2   2   1   3   1   3   1   2   3   3 ...   2   3 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.2984ms    (std::chrono Measured)
+    [   2   2   1   2   2   1   3   1   3   1   2   3   3 ...   2   3 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.7242ms    (std::chrono Measured)
+    [   2   2   1   2   2   1   3   1   3   1   2   3   3 ...   2   3 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.915008ms    (CUDA Measured)
+    [   2   2   1   2   2   1   3   1   3   1   2   3   3 ...   2   3 ]
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.89984ms    (CUDA Measured)
+    [   2   2   1   2   2   1   3   1   3   1   2   3   3 ...   2   3 ]
+    passed
+```
+
+# Bloopers
+The greatest issue that I ran into while working on this project was related to the down-sweep on my work-efficient scan. The bounds on my loop controlling the down sweep kernel invocations had been wrong intially:
+
+```
+cudaMemset(&dev_data[pow2 - 1], 0, sizeof(int));
+for (int d = maxDepth; d >= 0; d--) { // this should be d = maxDepth - 1 instead
+    kernDownSweep << <fullBlocksPerGrid, blockSize >> > (pow2, d, dev_data);
+}
+```
+This caused illegal memory accesses in the kernel, which I found surprising since I didn't find out about the error until the kernels had finished and I went to free the memory. I had actually not checked for errors after I freed initially, so the problem would spill over into the next call's malloc. Even weirder, when I ran the code in the NSight debugger, it would set an automatic breakpoint in the kernel and would be able to properly calculate the result. I'm not really sure why this happened, but there's definitely something that I'm missing about this.
+
+Here's the output:
+```
+****************
+** SCAN TESTS **
+****************
+    [  34  24   8  17  15  39  31  25  19   9  26  19  26 ...   8   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.7537ms    (std::chrono Measured)
+    [   0  34  58  66  83  98 137 168 193 212 221 247 266 ... 12833463 12833471 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.7565ms    (std::chrono Measured)
+    [   0  34  58  66  83  98 137 168 193 212 221 247 266 ... 12833421 12833452 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.510144ms    (CUDA Measured)
+    [   0  34  58  66  83  98 137 168 193 212 221 247 266 ... 12833463 12833471 ]
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.507712ms    (CUDA Measured)
+    [   0  34  58  66  83  98 137 168 193 212 221 247 266 ...   0   0 ]
+    passed
+==== work-efficient scan, power-of-two ====
+CUDA error: cudaFree failed on dev_data!: an illegal memory access was encountered
+```
 
diff --git a/data.xlsx b/data.xlsx
diff --git a/img/blocks.PNG b/img/blocks.PNG
diff --git a/img/compaction.PNG b/img/compaction.PNG
diff --git a/img/scan.PNG b/img/scan.PNG
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 17; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -51,7 +51,7 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
@@ -64,35 +64,35 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, non-power-of-two");
     StreamCompaction::Naive::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
     StreamCompaction::Efficient::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, power-of-two");
     StreamCompaction::Thrust::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, non-power-of-two");
     StreamCompaction::Thrust::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     printf("\n");
@@ -137,14 +137,14 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     system("pause"); // stop Win32 console from closing on exit

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -1,4 +1,5 @@
 #include "common.h"
+#include "device_launch_parameters.h"
 
 void checkCUDAErrorFn(const char *msg, const char *file, int line) {
     cudaError_t err = cudaGetLastError();
@@ -23,7 +24,10 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            int index = threadIdx.x + (blockDim.x * blockIdx.x);
+            if (index >= n) return;
+
+            bools[index] = idata[index] != 0 ? 1 : 0;
         }
 
         /**
@@ -32,8 +36,10 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
-        }
+            int index = threadIdx.x + (blockDim.x * blockIdx.x);
+            if (index >= n) return;
 
+            if (bools[index] == 1) odata[indices[index]] = idata[index];
+        }
     }
 }
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
@@ -12,6 +12,7 @@
 
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
+#define blockSize 128
 
 /**
  * Check for CUDA errors; print and exit if there was a problem.
@@ -108,6 +109,16 @@ namespace StreamCompaction {
                 return prev_elapsed_time_gpu_milliseconds;
             }
 
+            bool getCpuTimerStarted()
+            {
+                return cpu_timer_started;
+            }
+
+            bool getGpuTimerStarted()
+            {
+                return gpu_timer_started;
+            }
+
             // remove copy and move functions
             PerformanceTimer(const PerformanceTimer&) = delete;
             PerformanceTimer(PerformanceTimer&&) = delete;

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -1,4 +1,5 @@
 #include <cstdio>
+#include <stdio.h>
 #include "cpu.h"
 
 #include "common.h"
@@ -18,9 +19,18 @@ namespace StreamCompaction {
          * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
          */
         void scan(int n, int *odata, const int *idata) {
-            timer().startCpuTimer();
-            // TODO
-            timer().endCpuTimer();
+            bool localStartCall = false;
+            if (!timer().getCpuTimerStarted()) {
+                timer().startCpuTimer();
+                localStartCall = true;
+            }
+
+            odata[0] = 0; // identity for exclusive scan
+            for (int i = 1; i < n; i++) {
+                odata[i] = odata[i - 1] + idata[i - 1];
+            }
+
+            if (localStartCall) timer().endCpuTimer();
         }
 
         /**
@@ -30,9 +40,15 @@ namespace StreamCompaction {
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            int numElts = 0;
+            for (int i = 0; i < n; i++) {
+                if (idata[i] != 0) {
+                    odata[numElts] = idata[i];
+                    numElts++;
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return numElts;
         }
 
         /**
@@ -42,9 +58,20 @@ namespace StreamCompaction {
          */
         int compactWithScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            int *boolArray = new int[n];
+            int *scanResult = new int[n];
+            int numElts = 0;
+
+            for (int i = 0; i < n; i++) idata[i] != 0 ? boolArray[i] = 1 : boolArray[i] = 0;
+            scan(n, scanResult, boolArray);
+            for (int i = 0; i < n; i++) {
+                if (boolArray[i] == 1) {
+                    odata[scanResult[i]] = idata[i];
+                    numElts++;
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return numElts;
         }
     }
 }