Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Aditya Hota #6

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 95 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,101 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Aditya Hota
* [LinkedIn](https://www.linkedin.com/in/aditya-hota)
* Tested on: Windows 11, i7-8750H @ 2.20 GHz 20 GB, GTX 1050 Ti with Max-Q Design 6 GB (personal laptop)

### (TODO: Your README)
# Overview
This project involved implementing two algorithms (scan and stream compression) in parallel on the GPU. This allows us to work on a large set of data using multiple GPU threads at once, rather than relying on the sequential nature of the GPU.

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
## Features
* A naive parallel scan algorithm implemented in CUDA
* A work efficient parallel scan algorithm implemented in CUDA
* A stream compaction algorithm which leverages the scan algorithm to remove null data elements from an array of data
* CPU implementations of the scan and stream compaction arrays to provide a baseline comparison

# Sample Output
The program generates two arrays--one a power of two, and one slightly smaller. Then, the scan algorithm is run on both arrays. Lastly, the arrays of data are compacted by removing any zeroes. The scan algorithm is used to more easily find the indices of the non-zero values to write to the final output array.

Below is the sample output for a size 1024 array.

```
****************
** SCAN TESTS **
****************
[ 35 34 18 10 18 0 9 37 11 41 35 17 1 ... 2 0 ]
==== cpu scan, power-of-two ====
elapsed time: 0.011ms (std::chrono Measured)
[ 0 35 69 87 97 115 115 124 161 172 213 248 265 ... 100120 100122 ]
==== cpu scan, non-power-of-two ====
elapsed time: 0.0109ms (std::chrono Measured)
[ 0 35 69 87 97 115 115 124 161 172 213 248 265 ... 100031 100079 ]
passed
==== naive scan, power-of-two ====
elapsed time: 0.562432ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 0.48176ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0.32256ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.32768ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.663552ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.059392ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 2 3 3 2 2 2 0 1 3 3 1 1 0 ... 1 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 0.0173ms (std::chrono Measured)
[ 2 3 3 2 2 2 1 3 3 1 1 2 2 ... 2 1 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 0.021ms (std::chrono Measured)
[ 2 3 3 2 2 2 1 3 3 1 1 2 2 ... 2 2 ]
passed
==== cpu compact with scan ====
elapsed time: 0.0583ms (std::chrono Measured)
[ 2 3 3 2 2 2 1 3 3 1 1 2 2 ... 2 1 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 0.646144ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 0.377856ms (CUDA Measured)
passed
```

# Performance Analysis
Several versions of the scan algorithm were implemented: CPU, naive, work-efficient, and Thrust. The CPU implementation sequentially performs scan and scatter to remove any 0 elements. The naive GPU implementation performs sequential sums as pairs and eventually collapses all values into a single sum. The work-efficient algorithm uses an up-sweep (reduction) and down-sweep algorithm which exploits a greater degree of parallelism and should theoretically result in improved performance.

The best performance was achieved using a block size of 128 on my GPU; this gave me the smallest runtime for both the naive and work-efficient implementations of the scan algorithms. I tried powers of 2 ranging from 64 to 1024.

From the graphs shown below, we can see that we did not attain the improvement from the wofk-efficient algorithm that we had hoped. In fact, the work-efficient scan algorithm performs marginally better than the naive algorithm and worse than the CPU implementation. The Thrust algorithm performs the best, but this is expected because there are many optimizations that take place, such as exploiting unused threads and shared memory. Please note that the graphs are log-log plots, and that `npo2` in the legend refers to the non-power of 2 test case.

<br>
<img src="img/ScanResults.png" width="750">

<br>
<img src="img/StreamResults.png" width="750">

The performance hit for the naive and work-efficient and algorithms is largely due to the global memory accesses in the up sweep and down sweep steps. We access data from the global data arrays which are passed into the kernel, and this is very inefficent as global memory is the slowest type of memory access. An optimization to this would be to use shared memory (similar to the matrix multiply example) and data prefetching. Since we are doing just addition operations, the performance hit is not due to compute instructions.

We can see the impacts of parallelism beginning to manifest. For small array scans, the CPU is a much better option compared to the GPU; however, as the array size increases, the large number of threads on the GPU allows more computation to be performed at once. On the graph, this is shown by the CPU runtimes (in blue and orange) increasing as a much faster rate than the GPU runtimes. The Thrust implementation highlights this the most, because its rate of change of runtime as a function of array size is much smaller than that of the CPU. In theory, my GPU would allow me to ultimate reach better performance than the CPU, but I was not able to reach this point as the maximum array size supported was `10^27`. I believe part of the reason for the worse GPU performance is that I have the lowest tier Pascal GPU. One test I would like to run in the future is the scan algorithm on a GTX 1070 eGPU that I can access.

## NSight Analysis
To get a better idea of why the Thrust library runs more quickly than the work-efficient algorithm, we can trace the execution in NSight to see which components take the longest to run. Below is the output comparing the work-efficient GPU scan (non power of two) to the Thrust scan (non power of two). The work-efficient scan runs from 9 ms till about 10.62 ms, and the Thrust scan runs from 10.67 ms till 12.15 ms.

<br>
<img src="img/CudaUtilization.png" width="750">


The bottom row shows the CUDA usage--green is copying data from host to device and red is copying data from device to host; blue is when the kernel runs. As we can see, the green and red data usage is about the same since data needs to go to the GPU and back. However, our work-efficient implementation executes the kernel for a longer time. This leads me to believe that the execution takes longer in our algorithm, but I am not sure if this includes global memory reads as well. Either way, the GPU is utilized for a longer period of time. The Thrust library is making optimizations to reduce memory read times and compute, which lets it achieve a much better performance.
Binary file added img/CudaUtilization.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ScanResults.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/StreamResults.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 17 additions & 3 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,13 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= n)
{
return;
}

bools[index] = (idata[index] == 0) ? 0 : 1;
}

/**
Expand All @@ -32,8 +38,16 @@ namespace StreamCompaction {
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
}
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= n)
{
return;
}

if (bools[index] == 1)
{
odata[indices[index]] = idata[index];
}
}
}
}
75 changes: 70 additions & 5 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,18 @@ namespace StreamCompaction {
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
// DONE-Part 1
if (n < 0)
{
timer().endCpuTimer();
return;
}

odata[0] = 0;
for (int i = 1; i < n; i++)
{
odata[i] = odata[i - 1] + idata[i - 1];
}
timer().endCpuTimer();
}

Expand All @@ -30,9 +41,26 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
// DONE-Part 1
if (n < 0)
{
timer().endCpuTimer();
return 0;
}

int count = 0;
for (int i = 0; i < n; i++)
{
int data = idata[i];
if (data != 0)
{
odata[count] = data;
count++;
}
}

timer().endCpuTimer();
return -1;
return count;
}

/**
Expand All @@ -42,9 +70,46 @@ namespace StreamCompaction {
*/
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
// DONE-Part 1
if (n < 0)
{
timer().endCpuTimer();
return 0;
}

// Transform input into binary array
int* valid_indices = new int[n];
for (int i = 0; i < n; i++)
{
valid_indices[i] = (idata[i] == 0) ? 0 : 1;
}

// Run scan algorithm on valid index data
// Please note that I could not call the scan function as it starts
// the timer again and causes the program to crash.
int *output_indices = new int[n];
output_indices[0] = 0;
for (int i = 1; i < n; i++)
{
output_indices[i] = output_indices[i - 1] + valid_indices[i - 1];
}

// Write valid data to output based on indices computed in scan
int count = 0;
for (int i = 0; i < n; i++)
{
if (valid_indices[i] == 1)
{
odata[output_indices[i]] = idata[i];
count++;
}
}

delete[] valid_indices;
delete[] output_indices;

timer().endCpuTimer();
return -1;
return count;
}
}
}
130 changes: 126 additions & 4 deletions stream_compaction/efficient.cu
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
#include "common.h"
#include "efficient.h"

#define BLOCK_SIZE 128

namespace StreamCompaction {
namespace Efficient {
using StreamCompaction::Common::PerformanceTimer;
Expand All @@ -12,13 +14,77 @@ namespace StreamCompaction {
return timer;
}

/**
* Performs an in-place up sweep on the data.
*/
__global__ void kernUpSweep(int N, int d, int *data) {
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= N)
{
return;
}

if (index % (1 << (d + 1)) == 0)
{
data[index + (1 << (d + 1)) - 1] += data[index + (1 << d) - 1];
}
}

/**
* Performs an in-place down sweep on the data.
*/
__global__ void kernDownSweep(int N, int d, int *data) {
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= N)
{
return;
}

if (index % (1 << (d + 1)) == 0)
{
int t = data[index + (1 << d) - 1];
data[index + (1 << d) - 1] = data[index + (1 << (d + 1)) - 1];
data[index + (1 << (d + 1)) - 1] += t;
}
}

/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
void scan(int n, int* odata, const int* idata) {
// Done - Part 3.1
int tree_depth = ilog2ceil(n);
int data_length = 1 << tree_depth;
int data_bytes = data_length * sizeof(int);

int* dev_data;
cudaMalloc((void**)&dev_data, data_bytes);
cudaMemset(dev_data, 0, data_bytes);
cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice);

timer().startGpuTimer();
// TODO
dim3 fullBlocksPerGrid((data_length + BLOCK_SIZE - 1) / BLOCK_SIZE);

// Perform up sweep
for (int d = 0; d <= tree_depth - 1; d++)
{
kernUpSweep<<<fullBlocksPerGrid, BLOCK_SIZE>>>(data_length, d, dev_data);
}

// Zero out the root
cudaMemset(dev_data + data_length - 1, 0, sizeof(int));

// Perform down sweep
for (int d = tree_depth - 1; d >= 0; d--)
{
kernDownSweep<<<fullBlocksPerGrid, BLOCK_SIZE>>>(data_length, d, dev_data);
}

timer().endGpuTimer();

// Copy scanned array to host
cudaMemcpy(odata, dev_data, n * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(dev_data);
}

/**
Expand All @@ -31,10 +97,66 @@ namespace StreamCompaction {
* @returns The number of elements remaining after compaction.
*/
int compact(int n, int *odata, const int *idata) {
// Done-Part 3.2
int tree_depth = ilog2ceil(n);
int data_length = 1 << tree_depth;
int data_bytes = data_length * sizeof(int);
int num_elts = 0;

int *dev_orig_data;
cudaMalloc((void**)&dev_orig_data, data_bytes);
cudaMemset(dev_orig_data, 0, data_bytes);
cudaMemcpy(dev_orig_data, idata, n * sizeof(int), cudaMemcpyHostToDevice);

int *dev_output;
cudaMalloc((void**)&dev_output, n * sizeof(int));

int *dev_valid_indices;
cudaMalloc((void**)&dev_valid_indices, data_bytes);

int *dev_scanned_indices;
cudaMalloc((void**)&dev_scanned_indices, data_bytes);

timer().startGpuTimer();
// TODO
dim3 fullBlocksPerGrid((data_length + BLOCK_SIZE - 1) / BLOCK_SIZE);

// Transform input into binary array
StreamCompaction::Common::kernMapToBoolean<<<fullBlocksPerGrid, BLOCK_SIZE>>>(data_length, dev_valid_indices, dev_orig_data);
cudaMemcpy(dev_scanned_indices, dev_valid_indices, data_bytes, cudaMemcpyDeviceToDevice);

// Perform up sweep
for (int d = 0; d <= tree_depth - 1; d++)
{
kernUpSweep<<<fullBlocksPerGrid, BLOCK_SIZE>>>(data_length, d, dev_scanned_indices);
}

// Zero out the root
cudaMemset(dev_scanned_indices + data_length - 1, 0, sizeof(int));

// Perform down sweep
for (int d = tree_depth - 1; d >= 0; d--)
{
kernDownSweep<<<fullBlocksPerGrid, BLOCK_SIZE>>>(data_length, d, dev_scanned_indices);
}

// Write valid data to output based on indices computed in scan (scatter)
StreamCompaction::Common::kernScatter<<<fullBlocksPerGrid, BLOCK_SIZE>>>(data_length, dev_output, dev_orig_data, dev_valid_indices, dev_scanned_indices);

timer().endGpuTimer();
return -1;

// Copy filtered data to host
cudaMemcpy(odata, dev_output, n * sizeof(int), cudaMemcpyDeviceToHost);

// Copy number of elements
cudaMemcpy(&num_elts, dev_scanned_indices + data_length - 1, sizeof(int), cudaMemcpyDeviceToHost);

// Free all buffers
cudaFree(dev_orig_data);
cudaFree(dev_valid_indices);
cudaFree(dev_scanned_indices);
cudaFree(dev_output);

return num_elts;
}
}
}
Loading