CIS565-Fall-2021 · adityahota · Sep 19, 2021 · Sep 19, 2021 · Sep 20, 2021 · Sep 20, 2021
diff --git a/README.md b/README.md
@@ -3,12 +3,101 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Aditya Hota
+  * [LinkedIn](https://www.linkedin.com/in/aditya-hota)
+* Tested on: Windows 11, i7-8750H @ 2.20 GHz 20 GB, GTX 1050 Ti with Max-Q Design 6 GB (personal laptop)
 
-### (TODO: Your README)
+# Overview
+This project involved implementing two algorithms (scan and stream compression) in parallel on the GPU. This allows us to work on a large set of data using multiple GPU threads at once, rather than relying on the sequential nature of the GPU.
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+## Features
+* A naive parallel scan algorithm implemented in CUDA
+* A work efficient parallel scan algorithm implemented in CUDA
+* A stream compaction algorithm which leverages the scan algorithm to remove null data elements from an array of data
+* CPU implementations of the scan and stream compaction arrays to provide a baseline comparison
 
+# Sample Output
+The program generates two arrays--one a power of two, and one slightly smaller. Then, the scan algorithm is run on both arrays. Lastly, the arrays of data are compacted by removing any zeroes. The scan algorithm is used to more easily find the indices of the non-zero values to write to the final output array.
+
+Below is the sample output for a size 1024 array.
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  35  34  18  10  18   0   9  37  11  41  35  17   1 ...   2   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.011ms    (std::chrono Measured)
+    [   0  35  69  87  97 115 115 124 161 172 213 248 265 ... 100120 100122 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.0109ms    (std::chrono Measured)
+    [   0  35  69  87  97 115 115 124 161 172 213 248 265 ... 100031 100079 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.562432ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.48176ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.32256ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.32768ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.663552ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.059392ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   2   3   3   2   2   2   0   1   3   3   1   1   0 ...   1   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.0173ms    (std::chrono Measured)
+    [   2   3   3   2   2   2   1   3   3   1   1   2   2 ...   2   1 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.021ms    (std::chrono Measured)
+    [   2   3   3   2   2   2   1   3   3   1   1   2   2 ...   2   2 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.0583ms    (std::chrono Measured)
+    [   2   3   3   2   2   2   1   3   3   1   1   2   2 ...   2   1 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.646144ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.377856ms    (CUDA Measured)
+    passed
+```
+
+# Performance Analysis
+Several versions of the scan algorithm were implemented: CPU, naive, work-efficient, and Thrust. The CPU implementation sequentially performs scan and scatter to remove any 0 elements. The naive GPU implementation performs sequential sums as pairs and eventually collapses all values into a single sum. The work-efficient algorithm uses an up-sweep (reduction) and down-sweep algorithm which exploits a greater degree of parallelism and should theoretically result in improved performance.
+
+The best performance was achieved using a block size of 128 on my GPU; this gave me the smallest runtime for both the naive and work-efficient implementations of the scan algorithms. I tried powers of 2 ranging from 64 to 1024.
+
+From the graphs shown below, we can see that we did not attain the improvement from the wofk-efficient algorithm that we had hoped. In fact, the work-efficient scan algorithm performs marginally better than the naive algorithm and worse than the CPU implementation. The Thrust algorithm performs the best, but this is expected because there are many optimizations that take place, such as exploiting unused threads and shared memory. Please note that the graphs are log-log plots, and that `npo2` in the legend refers to the non-power of 2 test case.
+
+<br>
+<img src="img/ScanResults.png" width="750">
+
+<br>
+<img src="img/StreamResults.png" width="750">
+
+The performance hit for the naive and work-efficient and algorithms is largely due to the global memory accesses in the up sweep and down sweep steps. We access data from the global data arrays which are passed into the kernel, and this is very inefficent as global memory is the slowest type of memory access. An optimization to this would be to use shared memory (similar to the matrix multiply example) and data prefetching. Since we are doing just addition operations, the performance hit is not due to compute instructions.
+
+We can see the impacts of parallelism beginning to manifest. For small array scans, the CPU is a much better option compared to the GPU; however, as the array size increases, the large number of threads on the GPU allows more computation to be performed at once. On the graph, this is shown by the CPU runtimes (in blue and orange) increasing as a much faster rate than the GPU runtimes. The Thrust implementation highlights this the most, because its rate of change of runtime as a function of array size is much smaller than that of the CPU. In theory, my GPU would allow me to ultimate reach better performance than the CPU, but I was not able to reach this point as the maximum array size supported was `10^27`. I believe part of the reason for the worse GPU performance is that I have the lowest tier Pascal GPU. One test I would like to run in the future is the scan algorithm on a GTX 1070 eGPU that I can access.
+
+## NSight Analysis
+To get a better idea of why the Thrust library runs more quickly than the work-efficient algorithm, we can trace the execution in NSight to see which components take the longest to run. Below is the output comparing the work-efficient GPU scan (non power of two) to the Thrust scan (non power of two). The work-efficient scan runs from 9 ms till about 10.62 ms, and the Thrust scan runs from 10.67 ms till 12.15 ms.
+
+<br>
+<img src="img/CudaUtilization.png" width="750">
+
+
+The bottom row shows the CUDA usage--green is copying data from host to device and red is copying data from device to host; blue is when the kernel runs. As we can see, the green and red data usage is about the same since data needs to go to the GPU and back. However, our work-efficient implementation executes the kernel for a longer time. This leads me to believe that the execution takes longer in our algorithm, but I am not sure if this includes global memory reads as well. Either way, the GPU is utilized for a longer period of time. The Thrust library is making optimizations to reduce memory read times and compute, which lets it achieve a much better performance.
diff --git a/img/CudaUtilization.png b/img/CudaUtilization.png
diff --git a/img/ScanResults.png b/img/ScanResults.png
diff --git a/img/StreamResults.png b/img/StreamResults.png
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -23,7 +23,13 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index >= n)
+            {
+                return;
+            }
+
+            bools[index] = (idata[index] == 0) ? 0 : 1;
         }
 
         /**
@@ -32,8 +38,16 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
-        }
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index >= n)
+            {
+                return;
+            }
 
+            if (bools[index] == 1)
+            {
+                odata[indices[index]] = idata[index];
+            }
+        }
     }
 }
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -19,7 +19,18 @@ namespace StreamCompaction {
          */
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            // DONE-Part 1
+            if (n < 0)
+            {
+                timer().endCpuTimer();
+                return;
+            }
+
+            odata[0] = 0;
+            for (int i = 1; i < n; i++)
+            {
+                odata[i] = odata[i - 1] + idata[i - 1];
+            }
             timer().endCpuTimer();
         }
 
@@ -30,9 +41,26 @@ namespace StreamCompaction {
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            // DONE-Part 1
+            if (n < 0)
+            {
+                timer().endCpuTimer();
+                return 0;
+            }
+
+            int count = 0;
+            for (int i = 0; i < n; i++)
+            {
+                int data = idata[i];
+                if (data != 0)
+                {
+                    odata[count] = data;
+                    count++;
+                }
+            }
+
             timer().endCpuTimer();
-            return -1;
+            return count;
         }
 
         /**
@@ -42,9 +70,46 @@ namespace StreamCompaction {
          */
         int compactWithScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            // DONE-Part 1
+            if (n < 0)
+            {
+                timer().endCpuTimer();
+                return 0;
+            }
+
+            // Transform input into binary array
+            int* valid_indices = new int[n];
+            for (int i = 0; i < n; i++)
+            {
+                valid_indices[i] = (idata[i] == 0) ? 0 : 1;
+            }
+
+            // Run scan algorithm on valid index data
+            // Please note that I could not call the scan function as it starts
+            // the timer again and causes the program to crash.
+            int *output_indices = new int[n];
+            output_indices[0] = 0;
+            for (int i = 1; i < n; i++)
+            {
+                output_indices[i] = output_indices[i - 1] + valid_indices[i - 1];
+            }
+
+            // Write valid data to output based on indices computed in scan
+            int count = 0;
+            for (int i = 0; i < n; i++)
+            {
+                if (valid_indices[i] == 1)
+                {
+                    odata[output_indices[i]] = idata[i];
+                    count++;
+                }
+            }
+
+            delete[] valid_indices;
+            delete[] output_indices;
+
             timer().endCpuTimer();
-            return -1;
+            return count;
         }
     }
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
@@ -3,6 +3,8 @@
 #include "common.h"
 #include "efficient.h"
 
+#define BLOCK_SIZE 128
+
 namespace StreamCompaction {
     namespace Efficient {
         using StreamCompaction::Common::PerformanceTimer;
@@ -12,13 +14,77 @@ namespace StreamCompaction {
             return timer;
         }
 
+        /**
+         * Performs an in-place up sweep on the data.
+         */
+        __global__ void kernUpSweep(int N, int d, int *data) {
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index >= N)
+            {
+                return;
+            }
+
+            if (index % (1 << (d + 1)) == 0)
+            {
+                data[index + (1 << (d + 1)) - 1] += data[index + (1 << d) - 1];
+            }
+        }
+
+        /**
+         * Performs an in-place down sweep on the data.
+         */
+        __global__ void kernDownSweep(int N, int d, int *data) {
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index >= N)
+            {
+                return;
+            }
+
+            if (index % (1 << (d + 1)) == 0)
+            {
+                int t = data[index + (1 << d) - 1];
+                data[index + (1 << d) - 1] = data[index + (1 << (d + 1)) - 1];
+                data[index + (1 << (d + 1)) - 1] += t;
+            }
+        }
+
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
-        void scan(int n, int *odata, const int *idata) {
+        void scan(int n, int* odata, const int* idata) {
+            // Done - Part 3.1
+            int tree_depth = ilog2ceil(n);
+            int data_length = 1 << tree_depth;
+            int data_bytes = data_length * sizeof(int);
+
+            int* dev_data;
+            cudaMalloc((void**)&dev_data, data_bytes);
+            cudaMemset(dev_data, 0, data_bytes);
+            cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
             timer().startGpuTimer();
-            // TODO
+            dim3 fullBlocksPerGrid((data_length + BLOCK_SIZE - 1) / BLOCK_SIZE);
+
+            // Perform up sweep
+            for (int d = 0; d <= tree_depth - 1; d++)
+            {
+                kernUpSweep<<<fullBlocksPerGrid, BLOCK_SIZE>>>(data_length, d, dev_data);
+            }
+
+            // Zero out the root
+            cudaMemset(dev_data + data_length - 1, 0, sizeof(int));
+
+            // Perform down sweep
+            for (int d = tree_depth - 1; d >= 0; d--)
+            {
+                kernDownSweep<<<fullBlocksPerGrid, BLOCK_SIZE>>>(data_length, d, dev_data);
+            }
+
             timer().endGpuTimer();
+
+            // Copy scanned array to host
+            cudaMemcpy(odata, dev_data, n * sizeof(int), cudaMemcpyDeviceToHost);
+            cudaFree(dev_data);
         }
 
         /**
@@ -31,10 +97,66 @@ namespace StreamCompaction {
          * @returns      The number of elements remaining after compaction.
          */
         int compact(int n, int *odata, const int *idata) {
+            // Done-Part 3.2
+            int tree_depth = ilog2ceil(n);
+            int data_length = 1 << tree_depth;
+            int data_bytes = data_length * sizeof(int);
+            int num_elts = 0;
+
+            int *dev_orig_data;
+            cudaMalloc((void**)&dev_orig_data, data_bytes);
+            cudaMemset(dev_orig_data, 0, data_bytes);
+            cudaMemcpy(dev_orig_data, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+            int *dev_output;
+            cudaMalloc((void**)&dev_output, n * sizeof(int));
+
+            int *dev_valid_indices;
+            cudaMalloc((void**)&dev_valid_indices, data_bytes);
+
+            int *dev_scanned_indices;
+            cudaMalloc((void**)&dev_scanned_indices, data_bytes);
+
             timer().startGpuTimer();
-            // TODO
+            dim3 fullBlocksPerGrid((data_length + BLOCK_SIZE - 1) / BLOCK_SIZE);
+
+            // Transform input into binary array
+            StreamCompaction::Common::kernMapToBoolean<<<fullBlocksPerGrid, BLOCK_SIZE>>>(data_length, dev_valid_indices, dev_orig_data);
+            cudaMemcpy(dev_scanned_indices, dev_valid_indices, data_bytes, cudaMemcpyDeviceToDevice);
+
+            // Perform up sweep
+            for (int d = 0; d <= tree_depth - 1; d++)
+            {
+                kernUpSweep<<<fullBlocksPerGrid, BLOCK_SIZE>>>(data_length, d, dev_scanned_indices);
+            }
+
+            // Zero out the root
+            cudaMemset(dev_scanned_indices + data_length - 1, 0, sizeof(int));
+
+            // Perform down sweep
+            for (int d = tree_depth - 1; d >= 0; d--)
+            {
+                kernDownSweep<<<fullBlocksPerGrid, BLOCK_SIZE>>>(data_length, d, dev_scanned_indices);
+            }
+
+            // Write valid data to output based on indices computed in scan (scatter)
+            StreamCompaction::Common::kernScatter<<<fullBlocksPerGrid, BLOCK_SIZE>>>(data_length, dev_output, dev_orig_data, dev_valid_indices, dev_scanned_indices);
+
             timer().endGpuTimer();
-            return -1;
+
+            // Copy filtered data to host
+            cudaMemcpy(odata, dev_output, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+            // Copy number of elements
+            cudaMemcpy(&num_elts, dev_scanned_indices + data_length - 1, sizeof(int), cudaMemcpyDeviceToHost);
+
+            // Free all buffers
+            cudaFree(dev_orig_data);
+            cudaFree(dev_valid_indices);
+            cudaFree(dev_scanned_indices);
+            cudaFree(dev_output);
+
+            return num_elts;
         }
     }
 }