Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Bowen Deng #20

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
225 changes: 219 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,225 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Bowen Deng
* [LinkedIn](www.linkedin.com/in/bowen-deng-7dbw13), [twitter](https://twitter.com/7Dbw13)
* Tested on: Windows 10, AMD Ryzen 9 5900HX with Radeon Graphics @ 3.30GHz 16GB, GeForce RTX 3070 Laptop GPU 8GB (Personal Computer)

### (TODO: Your README)
## Abstract

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
GPU stream compaction and scan (prefix sum) algorithm implemented in CUDA.
> Examples

* scan:
- goal: produce a prefix sum array of a given array (we only care about exclusive scan here)
- input
- [1 5 0 1 2 0 3]
- output
- [0 1 6 6 7 9 9]
* compact:
- goal: closely and neatly packed the elements != 0
- input
- [1 5 0 1 2 0 3]
- output
- [1 5 1 2 3]

Different kinds of scan algorithms implemented in this project:

1. CPU based scan
2. Naive GPU scan
3. Work-Efficient GPU scan
4. GPU scan using Thrust
5. Upgraded GPU efficient scan (*extra credit*)
6. GPU scan using shared memory (*extra credit*)

Different kinds of compaction implemented in this project:

1. CPU based compaction
2. CPU compaction using scan
3. GPU compaction using work-efficient GPU scan

## CMake Note

The `CMakeLists.txt` files are modified. New files are added in the following way

```
.
└── stream_compaction
├── radix_sort.cu
└── radix_sort.h
```

## Output

Run the test program on array with size `65536`
```
****************
** SCAN TESTS **
****************
[ 29 17 2 3 30 15 27 26 30 1 2 23 4 ... 43 0 ]
==== cpu scan, power-of-two ====
elapsed time: 0.0295ms (std::chrono Measured)
[ 0 29 46 48 51 81 96 123 149 179 180 182 205 ... 1610735 1610778 ]
==== cpu scan, non-power-of-two ====
elapsed time: 0.0266ms (std::chrono Measured)
[ 0 29 46 48 51 81 96 123 149 179 180 182 205 ... 1610664 1610685 ]
passed
==== naive scan, power-of-two ====
elapsed time: 0.093184ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 0.093184ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0.089088ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.090112ms (CUDA Measured)
passed
==== work-efficient-upgraded scan, power-of-two ====
elapsed time: 0.090112ms (CUDA Measured)
passed
==== work-efficient-upgraded scan, non-power-of-two ====
elapsed time: 0.089088ms (CUDA Measured)
passed
==== work-efficient-shared scan, power-of-two ====
elapsed time: 0.038912ms (CUDA Measured)
passed
==== work-efficient-shared scan, non-power-of-two ====
elapsed time: 0.038912ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.041984ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.045056ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 1 1 0 3 0 3 3 0 0 3 0 1 2 ... 3 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 0.1044ms (std::chrono Measured)
[ 1 1 3 3 3 3 1 2 2 3 3 1 3 ... 1 3 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 0.1054ms (std::chrono Measured)
[ 1 1 3 3 3 3 1 2 2 3 3 1 3 ... 3 3 ]
passed
==== cpu compact with scan ====
elapsed time: 0.2209ms (std::chrono Measured)
passed
==== work-efficient compact, power-of-two ====
elapsed time: 0.136256ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 0.137216ms (CUDA Measured)
passed
```

## Performance Analysis

### Measurement Metric

`std::chrono` and CUDA event are applied to measure the time cost of algorithms on CPU and GPU respectively. The unit used for timing is `ms`.

### Block Size Optimization

To optimize the block sizes for each kind of scan algorithm, I run all GPU algorithms with different block sizes on an array with size `65536`. The results are shown as below

![](img/block.png)

According to this, the block size of the upgraded GPU efficient scan is chosen to be `128`, while all others are set to be `256`.

### Scan Performance with Different Array Size

![](img/scan.png)

From the result we can see:

- The CPU based scan performs surprisingly well for a small array, but as the size grows, its performance drops dramatically due to high computation complexity.
- The naive GPU scan is even worse than the CPU method, and its performance also drops for a large size since the computation is also complex.
- The work-efficient GPU scan has a performance similar to naive method at first, but it outperforms both naive GPU and CPU scans when the size is large enough. It can be considered that memory I/Os limits the performance for small arrays.
- The Thrust GPU scan is robust. Its performance curve shows almost a linear increment.
- The upgraded GPU efficient scan and GPU scan using shared memory will be talked about with details in extra credit part.

### Compaction Performance with Different Array Size

![](img/compact.png)

Notice that when scan is applied, the performance of CPU compaction drops sharply. This is because scan brings additional computation to the compaction algorithm, which is necessary for parallel computing in GPU method.

## Extra Credit

### Why is My GPU Approach So Slow

Obviously, the work-efficient GPU scan is not so efficient that it is even worse than the CPU based scan and similar to naive GPU scan, when the size of array is not large enough. The reason is that as the loop level of upper/down sweep goes deeper, more and more threads becomes inactive. Let's take a look into this part for example
```
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}

// Only for multiple of 2^(d+1)
if ((index & ((1 << (d + 1)) - 1)) == 0) {
idata[index + (1 << (d + 1)) - 1] += idata[index + (1 << d) - 1];
}
```
The code above performs reduction on `idata`. Note the `if` condition here. Every loop `d` is increased by `1`, and so the number of threads, whose `index` meets the condition, will be halved. These threads do nothing actually.

To resolve this, I multiply `index` of each thread by `2^(d+1)`. Now all threads with new `index` less than `n` satisfy the previous condition. A new version is like
```
unsigned long int index = threadIdx.x + (blockIdx.x * blockDim.x);

// Index hack
// Make use of all threads
index *= (1 << (d + 1));

if (index >= n) {
return;
}

// 'index' is now multiple of 2^(d+1)
idata[index + (1 << (d + 1)) - 1] += idata[index + (1 << d) - 1];
```
Thus each thread is utilized. At the same time, since the total number of active threads stays unchanged, the number of blocks launched should be halved every loop
```
// Reduction for log(n) times
for (int d = 0; d < ilog2ceil(n); d++) {
// Halve the number of blocks launched in each turn
act_n /= 2;
dim3 fullBlocksPerGrid((act_n + BLOCK_SIZE_UPGRADED - 1) / BLOCK_SIZE_UPGRADED);
kern_reduction << <fullBlocksPerGrid, BLOCK_SIZE_UPGRADED >> > (n, d, idata);
checkCUDAError("kern_reduction failed!");
}
```
As a result, a performance gain can be observed in the performance figure when the size of array is large enough.

### Radix Sort

A radix sort is implemented using the upgraded GPU efficient scan in `radix_sort.cu`. It can be called by
```
Sort::radix_sort(int n, int num_bits, int *odata, const int *idata);
```
where `n` is the length of the array, and `num_bits` refers to number of bits in keys. This function sorts `idata` from smaller to larger and stores the result to `odata`.

The result of test added in `main.cpp` is (`5`-bit keys, array size `16`)
```
***************
** SORT TEST **
***************
[ 18 14 7 11 16 5 3 10 1 25 21 11 13 3 19 16 ]
==== radix sort, power-of-two ====
[ 1 3 3 5 7 10 11 11 13 14 16 16 18 19 21 25 ]
==== radix sort, non-power-of-two ====
[ 1 3 5 7 10 11 11 13 14 16 18 21 25 ]
```

### GPU Scan Using Shared Memory

The shared memory-based scan is implemented according to [GPU Gem Ch 39](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html). However, due to the memory limitation, the original method can only handle arrays with size smaller than twice the maximal number of threads in one block (`2048` for my computer).

To resolve this, first I break the input array into blocks, and then run the original scan on each block. The sum of each block is stored into a new array. The increments between blocks can be computed by scanning this new array. Finally adding these increments to each element in the corresponding block results in a fully-scanned array.

From the performance figure, this method shows significantly improvement on the performance, and it even beats Thrust scan in my experiment settings.
Binary file added img/block.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/compact.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
81 changes: 74 additions & 7 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,11 @@
#include <stream_compaction/naive.h>
#include <stream_compaction/efficient.h>
#include <stream_compaction/thrust.h>
#include <stream_compaction/radix_sort.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 16; // feel free to change the size of array
//const int SIZE = 1 << 8; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand All @@ -36,13 +38,13 @@ int main(int argc, char* argv[]) {
// At first all cases passed because b && c are all zeroes.
zeroArray(SIZE, b);
printDesc("cpu scan, power-of-two");
StreamCompaction::CPU::scan(SIZE, b, a);
StreamCompaction::CPU::scan(SIZE, b, a, true);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(SIZE, b, true);

zeroArray(SIZE, c);
printDesc("cpu scan, non-power-of-two");
StreamCompaction::CPU::scan(NPOT, c, a);
StreamCompaction::CPU::scan(NPOT, c, a, true);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(NPOT, b, true);
printCmpResult(NPOT, b, c);
Expand All @@ -64,23 +66,65 @@ int main(int argc, char* argv[]) {
printDesc("naive scan, non-power-of-two");
StreamCompaction::Naive::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

//zeroArray(SIZE, c);
//printDesc("naive-shared scan, power-of-two");
//StreamCompaction::Naive_Shared::scan(SIZE, c, a, true);
//printElapsedTime(StreamCompaction::Naive_Shared::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
////printArray(SIZE, c, true);
//printCmpResult(SIZE, b, c);

//zeroArray(SIZE, c);
//printDesc("naive-shared scan, non-power-of-two");
//StreamCompaction::Naive_Shared::scan(NPOT, c, a, true);
//printElapsedTime(StreamCompaction::Naive_Shared::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
////printArray(NPOT, c, true);
//printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
StreamCompaction::Efficient::scan(SIZE, c, a, true);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
StreamCompaction::Efficient::scan(NPOT, c, a, true);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient-upgraded scan, power-of-two");
StreamCompaction::Efficient_Upgraded::scan(SIZE, c, a, true);
printElapsedTime(StreamCompaction::Efficient_Upgraded::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient-upgraded scan, non-power-of-two");
StreamCompaction::Efficient_Upgraded::scan(NPOT, c, a, true);
printElapsedTime(StreamCompaction::Efficient_Upgraded::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient-shared scan, power-of-two");
StreamCompaction::Efficient_Shared::scan(SIZE, c, a, true);
printElapsedTime(StreamCompaction::Efficient_Shared::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient-shared scan, non-power-of-two");
StreamCompaction::Efficient_Shared::scan(NPOT, c, a, true);
printElapsedTime(StreamCompaction::Efficient_Shared::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, power-of-two");
StreamCompaction::Thrust::scan(SIZE, c, a);
Expand Down Expand Up @@ -130,7 +174,7 @@ int main(int argc, char* argv[]) {
printDesc("cpu compact with scan");
count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(count, c, true);
//printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
Expand All @@ -147,6 +191,29 @@ int main(int argc, char* argv[]) {
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

printf("\n");
printf("***************\n");
printf("** SORT TEST **\n");
printf("***************\n");

// Sort test

int bits = 5;
int sort_size = 1 << 4;

genArray(sort_size, a, (1 << bits) - 1);
printArray(sort_size, a, true);

zeroArray(sort_size, b);
printDesc("radix sort, power-of-two");
Sort::radix_sort(sort_size, bits, b, a);
printArray(sort_size, b, true);

zeroArray(sort_size, c);
printDesc("radix sort, non-power-of-two");
Sort::radix_sort(sort_size - 3, bits, c, a);
printArray(sort_size - 3, c, true);

system("pause"); // stop Win32 console from closing on exit
delete[] a;
delete[] b;
Expand Down
2 changes: 2 additions & 0 deletions stream_compaction/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ set(headers
"naive.h"
"efficient.h"
"thrust.h"
"radix_sort.h"
)

set(sources
Expand All @@ -12,6 +13,7 @@ set(sources
"naive.cu"
"efficient.cu"
"thrust.cu"
"radix_sort.cu"
)

list(SORT headers)
Expand Down
Loading