CIS565-Fall-2021 · codeplay9800 · Sep 17, 2021 · Sep 17, 2021 · Sep 19, 2021 · Sep 20, 2021
diff --git a/README.md b/README.md
@@ -3,12 +3,48 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Shubham Sharma
+  * [LinkedIn](www.linkedin.com/in/codeshubham), [personal website](https://shubhvr.com/).
+* Tested on: Windows 10, i7-9750H @ 2.26GHz, 16GB, GTX 1660ti 6GB (Personal Computer).
+*GPU Compute Capability: 7.5
 
-### (TODO: Your README)
+## Stream Compaction
+This project involves 
+-   CPU version of Scan,
+-   CPU version of Compact without using the Scan algorithm,
+-   CPU version of Compact with Scan,
+-   Naive version of Scan,
+-   Work-efficient version of Scan, and
+-   Work-efficient version of Compact that used the work-efficient Scan's code.
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+The three cpu calculations are serialized; no multi-threading was consolidated. We have used simple cpu scan and compaction to compare the results with the GPU parallelised algorithm implementation. All the results are then compared. Results of CUDA's Thrust library are also used to compare the execution times of each implementation.   
 
+## Performance Analysis
+The projects implements both CPU and GPU timing functions as a performance timer class to conveniently measure the time cost. `std::chrono` is used, to provide CPU high-precision timing and CUDA event to measure the CUDA performance. 
+I have collected the data across 8 executions with different array sizes to collect the data. The program generates a new array of random values with each execution, where the size of array is customisable. I have varied the size of the arrays by powers of two, starting from 2^6^ all the wai to 2^28^. The program also executes each algorithm for arrays of size "non- power of two" which are generated truncating the "power of two" arrays. 
+![Performance  Analysis](img/18.PNG)
+
+
+## Scan Runtime Analysis
+The performance of the four scan functions is graphed below. 
+![Scan Runtime Analysis](img/ScanAlgorithmAnalysis.png)
+
+- CPU Scan vs Other : From the graph above we can see that for array of smaller sizes the CPU implementation is way better than GPU Scan implementations but as the size of array increases the performance gap starts decreasing. This can be attributed toward the serialized algorithm/ implementation for CPU compared to the parallel implementation on the GPU. At some size of an array the GPU implementation would definitely have crossed over CPU's but, Unfortunately my GPU ran out of memory before i could reach that point. 
+- Naive vs Work Efficient : Throughout the executions of different sizes of Arrays the Naive Implementation performed consistently better than the Work Efficient. This is due to the fact that
+	- Even though it looks like we have limited the number of active threads while performing "Upsweeep" and "Downsweep" the threads which are not doing anything have to wait for the other active threads in the warp to finish to become available again.
+	- Those idle threads cant be utilised by the GPU to perform any other tasks in the same depth of an upsweep or downsweep thereby decreasing our Parallelism.
+
+## Compaction Runtime Analysis
+The performance of the three compact algorithm is graphed below. 
+![Scan Runtime Analysis](img/Compaction.png)
+
+We see a similar trend in values as above Scan algorithm as these compact algorithms derive from Scan algorithm respectively.
+
+## Radix Sort (Extra Credit)
+I have implemented "Parallel" radix sort which effectively work on compare bits of a decimal number by converting them to binary. This process starts by comparing least significant bit and continues until we have reached the most significant bit. 
+To check the authenticity of my implementation, I have compared results from mine to CUDA's Thrust::sort results. I tested it from 2^6^ size arrays upto 2^26^ where it passed on all the scenario's thereby validating my implementation. A screenshot of my result is shown below. An array of size 2^18^ is used.
+
+![Radix Sort](img/RadixSort.PNG)
+
+## Blooper
+Apparently on the GPU side function: pow(2,12)  was returning a value of 4027 which is super absurd. I fixed it by using bitwise opertaion 1<<12 which gave me the expected result 4028.
diff --git a/img/18.PNG b/img/18.PNG
diff --git a/img/22.PNG b/img/22.PNG
diff --git a/img/24.PNG b/img/24.PNG
diff --git a/img/26.PNG b/img/26.PNG
diff --git a/img/28.PNG b/img/28.PNG
diff --git a/img/2power_10.PNG b/img/2power_10.PNG
diff --git a/img/2power_12.PNG b/img/2power_12.PNG
diff --git a/img/2power_6.PNG b/img/2power_6.PNG
diff --git a/img/Compaction.png b/img/Compaction.png
diff --git a/img/RadixSort.PNG b/img/RadixSort.PNG
diff --git a/img/ScanAlgorithmAnalysis.png b/img/ScanAlgorithmAnalysis.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -12,12 +12,15 @@
 #include <stream_compaction/efficient.h>
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
+#include <stream_compaction/RadixSort.h>
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+//const int SIZE = 1 << 4; // feel free to change the size of array
+const int SIZE = 1 << 28; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
 int *c = new int[SIZE];
+int *d = new int[SIZE];
 
 int main(int argc, char* argv[]) {
     // Scan tests
@@ -51,7 +54,7 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    //printArray(SIZE, c, false);
     printCmpResult(SIZE, b, c);
 
     /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
@@ -64,7 +67,7 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, non-power-of-two");
     StreamCompaction::Naive::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    //printArray(SIZE, c, false);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
@@ -95,6 +98,46 @@ int main(int argc, char* argv[]) {
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
+    printf("\n");
+    printf("*****************************\n");
+    printf("** Radix Sort **\n");
+    printf("*****************************\n");
+
+    zeroArray(SIZE, d);
+    printDesc("Radix Sort(power-of-two): Thrust");
+    StreamCompaction::RadixSort::PerformThrustSort(SIZE, d, a);
+    printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(SIZE, a, false);
+    //printArray(SIZE, d, false);
+
+
+    zeroArray(SIZE, c);
+    printDesc("Radix Sort(power-of-two): My Implementation");
+    StreamCompaction::RadixSort::PerformGPUSort(SIZE, c, a);
+    printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(SIZE, a, false);
+    //printArray(SIZE, c, false);
+    printCmpResult(SIZE, d, c);
+
+
+    zeroArray(NPOT, d);
+    printDesc("Radix Sort(non-power-of-two): Thrust");
+    StreamCompaction::RadixSort::PerformThrustSort(SIZE, d, a);
+    printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(SIZE, a, false);
+    //printArray(SIZE, d, false);
+
+
+    zeroArray(NPOT, c);
+    printDesc("Radix Sort(non-power-of-two): My Implementation");
+    StreamCompaction::RadixSort::PerformGPUSort(SIZE, c, a);
+    printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(SIZE, a, false);
+    //printArray(SIZE, c, false);
+    printCmpResult(NPOT, d, c);
+
+
+
     printf("\n");
     printf("*****************************\n");
     printf("** STREAM COMPACTION TESTS **\n");
@@ -147,6 +190,8 @@ int main(int argc, char* argv[]) {
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+
+
     system("pause"); // stop Win32 console from closing on exit
     delete[] a;
     delete[] b;

diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
@@ -4,6 +4,7 @@ set(headers
     "naive.h"
     "efficient.h"
     "thrust.h"
+    "RadixSort.h"
     )
 
 set(sources
@@ -12,6 +13,7 @@ set(sources
     "naive.cu"
     "efficient.cu"
     "thrust.cu"
+    "RadixSort.cu"
     )
 
 list(SORT headers)

diff --git a/stream_compaction/RadixSort.cu b/stream_compaction/RadixSort.cu
@@ -0,0 +1,173 @@
+#include "RadixSort.h"
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <thrust/host_vector.h>
+#include <thrust/sort.h>
+namespace StreamCompaction {
+	namespace RadixSort {
+		using StreamCompaction::Common::PerformanceTimer;
+		PerformanceTimer& timer()
+		{
+			static PerformanceTimer timer;
+			return timer;
+		}
+
+#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)  // We can use defines provided in this project
+
+		int* dev_buf;
+		int* bufBit;
+		int* falseBuf;
+		int* trueBuf;
+		int* bufNotBits;
+		int* bufScatter;
+		int* bufAnswer;
+#define blockSize 512
+
+
+		void AllocateMemory(int n)
+		{
+			cudaMalloc((void**)&dev_buf, n * sizeof(int));
+			checkCUDAErrorWithLine("cudaMalloc dev_buf failed!");
+			cudaMalloc((void**)&bufBit, n * sizeof(int));
+			checkCUDAErrorWithLine("cudaMalloc dev_bufloader failed!");
+			cudaMalloc((void**)&falseBuf, n * sizeof(int));
+			checkCUDAErrorWithLine("cudaMalloc dev_bufB failed!");
+			cudaMalloc((void**)&trueBuf, n * sizeof(int));
+			checkCUDAErrorWithLine("cudaMalloc dev_bufS failed!");
+			cudaMalloc((void**)&bufNotBits, n * sizeof(int));
+			checkCUDAErrorWithLine("cudaMalloc dev_bufAnswers failed!");
+			cudaMalloc((void**)&bufScatter, n * sizeof(int));
+			checkCUDAErrorWithLine("cudaMalloc dev_bufAnswers failed!");
+			cudaMalloc((void**)&bufAnswer, n * sizeof(int));
+			checkCUDAErrorWithLine("cudaMalloc dev_bufAnswers failed!");
+			cudaDeviceSynchronize();
+		}
+
+		void FreeMemory() {
+			cudaFree(dev_buf);
+			cudaFree(bufBit);
+			cudaFree(falseBuf);
+			cudaFree(trueBuf);
+			cudaFree(bufNotBits);
+			cudaFree(bufScatter);
+			cudaFree(bufAnswer);
+		}
+
+
+		__global__ void PopulateBits(int bitOrder, int* bufInputData, int* bufBit, int N)
+		{
+			int index = threadIdx.x + (blockIdx.x * blockDim.x);
+			if (index > N - 1)
+			{
+				return;
+			}
+			int mask = 1 << bitOrder;
+			int masked_num = bufInputData[index] & mask;
+			int thebit = masked_num >> bitOrder;
+			bufBit[index] = thebit;
+		}
+
+		__global__ void PopulateNotBits(int *bitNotBits, const int* bufBits, int N)
+		{
+			int index = threadIdx.x + (blockIdx.x * blockDim.x);
+			if (index > N - 1)
+			{
+				return;
+			}
+			if (bufBits[index] == 0)
+			{
+				bitNotBits[index] = 1;
+				return;
+			}
+			bitNotBits[index] = 0;
+		}
+
+		__global__ void CopyAnswerToInputBuf(int* BufAnswer, int* ScatterBuffer, int* InputBuf, int N)
+		{
+			int index = threadIdx.x + (blockIdx.x * blockDim.x);
+			if (index > N - 1)
+			{
+				return;
+			}
+			BufAnswer[ScatterBuffer[index]] = InputBuf[index];
+		}
+
+
+		__global__ void ComputeTArray(int* outputBuf, int *falseBuf, int totalFalses, int N)
+		{
+			int index = threadIdx.x + (blockIdx.x * blockDim.x);
+			if (index > N - 1)
+			{
+				return;
+			}
+			outputBuf[index] = index - falseBuf[index] + totalFalses;
+		}
+
+		__global__ void PerformScatter(int* outputBuf, int* inputBuf, int* bitBuf, int*falseBuf, int *trueBuf, int N)
+		{
+			int index = threadIdx.x + (blockIdx.x * blockDim.x);
+			if (index > N - 1)
+			{
+				return;
+			}
+			if (bitBuf[index])
+			{
+				outputBuf[index] = trueBuf[index];
+				return;
+			}
+			outputBuf[index] = falseBuf[index];
+
+		}
+
+
+		void PerformThrustSort(int n, int* odata, const int* idata)
+		{
+			thrust::host_vector<int>hstIn(idata, idata + n);
+			timer().startGpuTimer();
+			// TODO use `thrust::exclusive_scan`
+			// example: for device_vectors dv_in and dv_out:
+			// thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+			thrust::sort(hstIn.begin(), hstIn.end());
+
+			thrust::copy(hstIn.begin(), hstIn.end(), odata);
+
+			timer().endGpuTimer();
+		}
+
+
+
+		void PerformGPUSort(int n, int* odata, const int* idata)
+		{
+			AllocateMemory(n);
+			cudaMemcpy(dev_buf, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+
+			dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+			timer().startGpuTimer();
+			for (int i = 0; i < 6; i++)
+			{
+				PopulateBits << < fullBlocksPerGrid, blockSize >> > (i, dev_buf, bufBit, n);
+				cudaDeviceSynchronize();
+				PopulateNotBits << < fullBlocksPerGrid, blockSize >> > (bufNotBits, bufBit, n);
+				cudaDeviceSynchronize();
+
+				int* inputNotBits= new int[n];
+				cudaMemcpy(inputNotBits, bufNotBits, n * sizeof(int), cudaMemcpyDeviceToHost);
+				Efficient::scan(n, odata, inputNotBits);
+				cudaMemcpy(falseBuf, odata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+				int TotalFalses = inputNotBits[n - 1] + odata[n - 1];
+				ComputeTArray << < fullBlocksPerGrid, blockSize >> > (trueBuf, falseBuf, TotalFalses, n);
+				cudaDeviceSynchronize();
+				PerformScatter << < fullBlocksPerGrid, blockSize >> > (bufScatter, dev_buf, bufBit, falseBuf, trueBuf, n);
+				cudaDeviceSynchronize();
+				CopyAnswerToInputBuf << < fullBlocksPerGrid, blockSize >> > (bufAnswer, bufScatter,  dev_buf, n);
+				cudaDeviceSynchronize();
+				std::swap(dev_buf, bufAnswer);
+				cudaDeviceSynchronize();
+			}
+			timer().endGpuTimer();
+			cudaMemcpy(odata, dev_buf, sizeof(int) * n, cudaMemcpyDeviceToHost);
+		}
+
+	}
+}
diff --git a/stream_compaction/RadixSort.h b/stream_compaction/RadixSort.h
@@ -0,0 +1,13 @@
+#pragma once
+#include "common.h"
+#include "efficient.h"
+
+namespace StreamCompaction {
+    namespace RadixSort {
+        StreamCompaction::Common::PerformanceTimer& timer();
+
+
+        void PerformThrustSort(int n, int* odata, const int* idata);
+        void PerformGPUSort(int n, int* odata, const int* idata);
+    }
+}
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -24,6 +24,17 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index > n - 1)
+            {
+                return;
+            }
+            if (idata[index] == 0)
+            {
+                bools[index] = 0;
+                return;
+            }
+            bools[index] = 1;
         }
 
         /**
@@ -33,6 +44,17 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index > n - 1)
+            {
+                return;
+            }
+
+            if (bools[index] == 0)
+            {
+                return;
+            }
+            odata[indices[index]] = idata[index];
         }
 
     }