Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Shubham Sharma #19

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 42 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,48 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Shubham Sharma
* [LinkedIn](www.linkedin.com/in/codeshubham), [personal website](https://shubhvr.com/).
* Tested on: Windows 10, i7-9750H @ 2.26GHz, 16GB, GTX 1660ti 6GB (Personal Computer).
*GPU Compute Capability: 7.5

### (TODO: Your README)
## Stream Compaction
This project involves
- CPU version of Scan,
- CPU version of Compact without using the Scan algorithm,
- CPU version of Compact with Scan,
- Naive version of Scan,
- Work-efficient version of Scan, and
- Work-efficient version of Compact that used the work-efficient Scan's code.

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
The three cpu calculations are serialized; no multi-threading was consolidated. We have used simple cpu scan and compaction to compare the results with the GPU parallelised algorithm implementation. All the results are then compared. Results of CUDA's Thrust library are also used to compare the execution times of each implementation.

## Performance Analysis
The projects implements both CPU and GPU timing functions as a performance timer class to conveniently measure the time cost. `std::chrono` is used, to provide CPU high-precision timing and CUDA event to measure the CUDA performance.
I have collected the data across 8 executions with different array sizes to collect the data. The program generates a new array of random values with each execution, where the size of array is customisable. I have varied the size of the arrays by powers of two, starting from 2^6^ all the wai to 2^28^. The program also executes each algorithm for arrays of size "non- power of two" which are generated truncating the "power of two" arrays.
![Performance Analysis](img/18.PNG)


## Scan Runtime Analysis
The performance of the four scan functions is graphed below.
![Scan Runtime Analysis](img/ScanAlgorithmAnalysis.png)

- CPU Scan vs Other : From the graph above we can see that for array of smaller sizes the CPU implementation is way better than GPU Scan implementations but as the size of array increases the performance gap starts decreasing. This can be attributed toward the serialized algorithm/ implementation for CPU compared to the parallel implementation on the GPU. At some size of an array the GPU implementation would definitely have crossed over CPU's but, Unfortunately my GPU ran out of memory before i could reach that point.
- Naive vs Work Efficient : Throughout the executions of different sizes of Arrays the Naive Implementation performed consistently better than the Work Efficient. This is due to the fact that
- Even though it looks like we have limited the number of active threads while performing "Upsweeep" and "Downsweep" the threads which are not doing anything have to wait for the other active threads in the warp to finish to become available again.
- Those idle threads cant be utilised by the GPU to perform any other tasks in the same depth of an upsweep or downsweep thereby decreasing our Parallelism.

## Compaction Runtime Analysis
The performance of the three compact algorithm is graphed below.
![Scan Runtime Analysis](img/Compaction.png)

We see a similar trend in values as above Scan algorithm as these compact algorithms derive from Scan algorithm respectively.

## Radix Sort (Extra Credit)
I have implemented "Parallel" radix sort which effectively work on compare bits of a decimal number by converting them to binary. This process starts by comparing least significant bit and continues until we have reached the most significant bit.
To check the authenticity of my implementation, I have compared results from mine to CUDA's Thrust::sort results. I tested it from 2^6^ size arrays upto 2^26^ where it passed on all the scenario's thereby validating my implementation. A screenshot of my result is shown below. An array of size 2^18^ is used.

![Radix Sort](img/RadixSort.PNG)

## Blooper
Apparently on the GPU side function: pow(2,12) was returning a value of 4027 which is super absurd. I fixed it by using bitwise opertaion 1<<12 which gave me the expected result 4028.
Binary file added img/18.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/22.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/24.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/26.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/28.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/2power_10.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/2power_12.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/2power_6.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/Compaction.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/RadixSort.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ScanAlgorithmAnalysis.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
51 changes: 48 additions & 3 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,15 @@
#include <stream_compaction/efficient.h>
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"
#include <stream_compaction/RadixSort.h>

const int SIZE = 1 << 8; // feel free to change the size of array
//const int SIZE = 1 << 4; // feel free to change the size of array
const int SIZE = 1 << 28; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
int *c = new int[SIZE];
int *d = new int[SIZE];

int main(int argc, char* argv[]) {
// Scan tests
Expand Down Expand Up @@ -51,7 +54,7 @@ int main(int argc, char* argv[]) {
printDesc("naive scan, power-of-two");
StreamCompaction::Naive::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
//printArray(SIZE, c, false);
printCmpResult(SIZE, b, c);

/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
Expand All @@ -64,7 +67,7 @@ int main(int argc, char* argv[]) {
printDesc("naive scan, non-power-of-two");
StreamCompaction::Naive::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
//printArray(SIZE, c, false);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
Expand Down Expand Up @@ -95,6 +98,46 @@ int main(int argc, char* argv[]) {
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

printf("\n");
printf("*****************************\n");
printf("** Radix Sort **\n");
printf("*****************************\n");

zeroArray(SIZE, d);
printDesc("Radix Sort(power-of-two): Thrust");
StreamCompaction::RadixSort::PerformThrustSort(SIZE, d, a);
printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, a, false);
//printArray(SIZE, d, false);


zeroArray(SIZE, c);
printDesc("Radix Sort(power-of-two): My Implementation");
StreamCompaction::RadixSort::PerformGPUSort(SIZE, c, a);
printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, a, false);
//printArray(SIZE, c, false);
printCmpResult(SIZE, d, c);


zeroArray(NPOT, d);
printDesc("Radix Sort(non-power-of-two): Thrust");
StreamCompaction::RadixSort::PerformThrustSort(SIZE, d, a);
printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, a, false);
//printArray(SIZE, d, false);


zeroArray(NPOT, c);
printDesc("Radix Sort(non-power-of-two): My Implementation");
StreamCompaction::RadixSort::PerformGPUSort(SIZE, c, a);
printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, a, false);
//printArray(SIZE, c, false);
printCmpResult(NPOT, d, c);



printf("\n");
printf("*****************************\n");
printf("** STREAM COMPACTION TESTS **\n");
Expand Down Expand Up @@ -147,6 +190,8 @@ int main(int argc, char* argv[]) {
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);



system("pause"); // stop Win32 console from closing on exit
delete[] a;
delete[] b;
Expand Down
2 changes: 2 additions & 0 deletions stream_compaction/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ set(headers
"naive.h"
"efficient.h"
"thrust.h"
"RadixSort.h"
)

set(sources
Expand All @@ -12,6 +13,7 @@ set(sources
"naive.cu"
"efficient.cu"
"thrust.cu"
"RadixSort.cu"
)

list(SORT headers)
Expand Down
173 changes: 173 additions & 0 deletions stream_compaction/RadixSort.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
#include "RadixSort.h"
#include <cuda.h>
#include <cuda_runtime.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
namespace StreamCompaction {
namespace RadixSort {
using StreamCompaction::Common::PerformanceTimer;
PerformanceTimer& timer()
{
static PerformanceTimer timer;
return timer;
}

#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__) // We can use defines provided in this project

int* dev_buf;
int* bufBit;
int* falseBuf;
int* trueBuf;
int* bufNotBits;
int* bufScatter;
int* bufAnswer;
#define blockSize 512


void AllocateMemory(int n)
{
cudaMalloc((void**)&dev_buf, n * sizeof(int));
checkCUDAErrorWithLine("cudaMalloc dev_buf failed!");
cudaMalloc((void**)&bufBit, n * sizeof(int));
checkCUDAErrorWithLine("cudaMalloc dev_bufloader failed!");
cudaMalloc((void**)&falseBuf, n * sizeof(int));
checkCUDAErrorWithLine("cudaMalloc dev_bufB failed!");
cudaMalloc((void**)&trueBuf, n * sizeof(int));
checkCUDAErrorWithLine("cudaMalloc dev_bufS failed!");
cudaMalloc((void**)&bufNotBits, n * sizeof(int));
checkCUDAErrorWithLine("cudaMalloc dev_bufAnswers failed!");
cudaMalloc((void**)&bufScatter, n * sizeof(int));
checkCUDAErrorWithLine("cudaMalloc dev_bufAnswers failed!");
cudaMalloc((void**)&bufAnswer, n * sizeof(int));
checkCUDAErrorWithLine("cudaMalloc dev_bufAnswers failed!");
cudaDeviceSynchronize();
}

void FreeMemory() {
cudaFree(dev_buf);
cudaFree(bufBit);
cudaFree(falseBuf);
cudaFree(trueBuf);
cudaFree(bufNotBits);
cudaFree(bufScatter);
cudaFree(bufAnswer);
}


__global__ void PopulateBits(int bitOrder, int* bufInputData, int* bufBit, int N)
{
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index > N - 1)
{
return;
}
int mask = 1 << bitOrder;
int masked_num = bufInputData[index] & mask;
int thebit = masked_num >> bitOrder;
bufBit[index] = thebit;
}

__global__ void PopulateNotBits(int *bitNotBits, const int* bufBits, int N)
{
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index > N - 1)
{
return;
}
if (bufBits[index] == 0)
{
bitNotBits[index] = 1;
return;
}
bitNotBits[index] = 0;
}

__global__ void CopyAnswerToInputBuf(int* BufAnswer, int* ScatterBuffer, int* InputBuf, int N)
{
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index > N - 1)
{
return;
}
BufAnswer[ScatterBuffer[index]] = InputBuf[index];
}


__global__ void ComputeTArray(int* outputBuf, int *falseBuf, int totalFalses, int N)
{
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index > N - 1)
{
return;
}
outputBuf[index] = index - falseBuf[index] + totalFalses;
}

__global__ void PerformScatter(int* outputBuf, int* inputBuf, int* bitBuf, int*falseBuf, int *trueBuf, int N)
{
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index > N - 1)
{
return;
}
if (bitBuf[index])
{
outputBuf[index] = trueBuf[index];
return;
}
outputBuf[index] = falseBuf[index];

}


void PerformThrustSort(int n, int* odata, const int* idata)
{
thrust::host_vector<int>hstIn(idata, idata + n);
timer().startGpuTimer();
// TODO use `thrust::exclusive_scan`
// example: for device_vectors dv_in and dv_out:
// thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
thrust::sort(hstIn.begin(), hstIn.end());

thrust::copy(hstIn.begin(), hstIn.end(), odata);

timer().endGpuTimer();
}



void PerformGPUSort(int n, int* odata, const int* idata)
{
AllocateMemory(n);
cudaMemcpy(dev_buf, idata, sizeof(int) * n, cudaMemcpyHostToDevice);

dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
timer().startGpuTimer();
for (int i = 0; i < 6; i++)
{
PopulateBits << < fullBlocksPerGrid, blockSize >> > (i, dev_buf, bufBit, n);
cudaDeviceSynchronize();
PopulateNotBits << < fullBlocksPerGrid, blockSize >> > (bufNotBits, bufBit, n);
cudaDeviceSynchronize();

int* inputNotBits= new int[n];
cudaMemcpy(inputNotBits, bufNotBits, n * sizeof(int), cudaMemcpyDeviceToHost);
Efficient::scan(n, odata, inputNotBits);
cudaMemcpy(falseBuf, odata, n * sizeof(int), cudaMemcpyHostToDevice);

int TotalFalses = inputNotBits[n - 1] + odata[n - 1];
ComputeTArray << < fullBlocksPerGrid, blockSize >> > (trueBuf, falseBuf, TotalFalses, n);
cudaDeviceSynchronize();
PerformScatter << < fullBlocksPerGrid, blockSize >> > (bufScatter, dev_buf, bufBit, falseBuf, trueBuf, n);
cudaDeviceSynchronize();
CopyAnswerToInputBuf << < fullBlocksPerGrid, blockSize >> > (bufAnswer, bufScatter, dev_buf, n);
cudaDeviceSynchronize();
std::swap(dev_buf, bufAnswer);
cudaDeviceSynchronize();
}
timer().endGpuTimer();
cudaMemcpy(odata, dev_buf, sizeof(int) * n, cudaMemcpyDeviceToHost);
}

}
}
13 changes: 13 additions & 0 deletions stream_compaction/RadixSort.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#pragma once
#include "common.h"
#include "efficient.h"

namespace StreamCompaction {
namespace RadixSort {
StreamCompaction::Common::PerformanceTimer& timer();


void PerformThrustSort(int n, int* odata, const int* idata);
void PerformGPUSort(int n, int* odata, const int* idata);
}
}
22 changes: 22 additions & 0 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,17 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index > n - 1)
{
return;
}
if (idata[index] == 0)
{
bools[index] = 0;
return;
}
bools[index] = 1;
}

/**
Expand All @@ -33,6 +44,17 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index > n - 1)
{
return;
}

if (bools[index] == 0)
{
return;
}
odata[indices[index]] = idata[index];
}

}
Expand Down
Loading