diff --git a/README.md b/README.md index 0e38ddb..8ad68f1 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,147 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Stephen Lee + * [LinkedIn](https://www.linkedin.com/in/stephen-lee-bb5a40163/) +* Tested on: Windows 10, i7-9750H @2.60GHz, RTX 2060 6GB (personal laptop) -### (TODO: Your README) +# Project Overview +The goal of this project was to analyze the performance of some useful algorithms. These `scan`, which computes a prefix sum on an input array. More can be read about prefix sums [here](https://en.wikipedia.org/wiki/Prefix_sum), as well as more information on the algorithms presented in this project. We then iterated on this scan implementation, creating a serial CPU algorithm, a naive parallel GPU algorithm, and a work-efficient parallel GPU algorithm. We then applied `scan` to compute other useful array operations such as `compaction` which is described in more detail [here](https://stackoverflow.com/questions/34059753/cuda-stream-compaction-algorithm). -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +### Features +* CPU `scan` and `compaction` +* Naive GPU implementation of `scan` +* Work-efficient GPU implementation of `scan` and `compaction` +* Wrapper for testing Thrust's `scan` implementation + +# Performance Analysis +Performance was measured using provided GPU and CPU timing classes, which could accurately time each algorithm's runtime in milliseconds (ms). Performance analysis began by determining the optimal block size for which testing should be done. In all the graphs that will be presented, a few abbreviations were made to make the graphs more readable. These are as follows: + +* WE = work-efficient +* POT = input size that is a power of 2 +* NPOT = input size that is not a power of 2 + + +### Finding an optimal block size +Block size refers to the number of threads defined in a block of threads in the launch paramters for the GPU parallel algorithms implemented. As such, only the GPU algorithms implemented in this project (naive and work-efficient) were tested here. A fixed input size of ~64,000,000 elements was used as a reasonably large input that would be conducive to using a CPU. Performance was measured for blocks whose sizes corresponded powers of 2 to work well with the underlying hardware in the range of 4 threads to 1024 threads in a block. The results are summarized in the graph below: + + + +From this analysis, it was found that a block size of 128 threads per block made the most sense for all of our GPU algorithms on this test setup. + +### Comparing scan algorithms +Once an optimal block size was determined, this was used to collect data on how the naive and work-efficient GPU implementations measured up to both the CPU version and the Thrust library which its own GPU implementation. The results for `scan` are summarized in the graph below: + + + +It was found that input sizes below 1,000,000 elements tended to favor the CPU implementation as opposed to any of the GPU implementations. This is pretty sensible since the benefits of GPU programming are really only seen as input sizes get really large and we can leverage more parallel computations. As such, I have only plotted data for input sizes above 1,000,000 since smaller inputs were not particularly interesting. The naive implementation of `scan` tended to perform better than the work-efficient implementation up until 32,000,000 elements, at which point the work-efficient method took over. In both cases, Thrust's implementation is far superior saving much more time than either of my implementations. + +This difference in performance can likely be attributed to a memory I/O bottleneck since both the naive and work-efficient implementations are doing many reads and writes to global memory, and there aren't really any other computations that are being done to cover up this latency. Regarding the differences observed between the naive implementation and the work-efficient implementation, this is likely due to the asymptotic behavior of the implementations, where the naive implementation has fewer serial operations for smaller input sizes than the work-efficient version. + +### Comparing compaction algorithms +After analyzing the performance of `scan`, the `compaction` algorithm was used as an application of what we had just implemented. In this case, we compare the perfomance of a purely CPU-inspired approach, a CPU approach that uses `scan` as a intermediate computation, and a GPU approach that uses our work-efficient implementation. Like with the scan implementation, we didn't really see the GPU pull ahead of the CPU until 1,000,000 elements were added into the array, so the graph below starts at this point. + + + +Here we see some similar trends, where the difference between the CPU versions and the work-efficient GPU version increased with increasing input size (e.g. the difference between the lines got bigger with larger inputs). Since this algorithm is largely an extension of `scan` with a few extra computations, it makes sense that we see a very similar trend here. + +Here's a sample output with 131072 elements to be processed: +``` +**************** +** SCAN TESTS ** +**************** + [ 8 20 33 38 32 34 23 5 38 7 22 39 15 ... 24 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0.1906ms (std::chrono Measured) + [ 0 8 28 61 99 131 165 188 193 231 238 260 299 ... 3206154 3206178 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 0.1897ms (std::chrono Measured) + [ 0 8 28 61 99 131 165 188 193 231 238 260 299 ... 3206075 3206103 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 0.346464ms (CUDA Measured) + [ 0 8 28 61 99 131 165 188 193 231 238 260 299 ... 3206154 3206178 ] + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.249376ms (CUDA Measured) + [ 0 8 28 61 99 131 165 188 193 231 238 260 299 ... 0 0 ] + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 0.535904ms (CUDA Measured) + [ 0 8 28 61 99 131 165 188 193 231 238 260 299 ... 3206154 3206178 ] + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.49104ms (CUDA Measured) + [ 0 8 28 61 99 131 165 188 193 231 238 260 299 ... 3206075 3206103 ] + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.4896ms (CUDA Measured) + [ 0 8 28 61 99 131 165 188 193 231 238 260 299 ... 3206154 3206178 ] + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.416352ms (CUDA Measured) + [ 0 8 28 61 99 131 165 188 193 231 238 260 299 ... 3206075 3206103 ] + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 2 1 2 2 0 1 3 0 1 0 3 1 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 0.3009ms (std::chrono Measured) + [ 2 2 1 2 2 1 3 1 3 1 2 3 3 ... 2 3 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.2984ms (std::chrono Measured) + [ 2 2 1 2 2 1 3 1 3 1 2 3 3 ... 2 3 ] + passed +==== cpu compact with scan ==== + elapsed time: 0.7242ms (std::chrono Measured) + [ 2 2 1 2 2 1 3 1 3 1 2 3 3 ... 2 3 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 0.915008ms (CUDA Measured) + [ 2 2 1 2 2 1 3 1 3 1 2 3 3 ... 2 3 ] + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 0.89984ms (CUDA Measured) + [ 2 2 1 2 2 1 3 1 3 1 2 3 3 ... 2 3 ] + passed +``` + +# Bloopers +The greatest issue that I ran into while working on this project was related to the down-sweep on my work-efficient scan. The bounds on my loop controlling the down sweep kernel invocations had been wrong intially: + +``` +cudaMemset(&dev_data[pow2 - 1], 0, sizeof(int)); +for (int d = maxDepth; d >= 0; d--) { // this should be d = maxDepth - 1 instead + kernDownSweep << > > (pow2, d, dev_data); +} +``` +This caused illegal memory accesses in the kernel, which I found surprising since I didn't find out about the error until the kernels had finished and I went to free the memory. I had actually not checked for errors after I freed initially, so the problem would spill over into the next call's malloc. Even weirder, when I ran the code in the NSight debugger, it would set an automatic breakpoint in the kernel and would be able to properly calculate the result. I'm not really sure why this happened, but there's definitely something that I'm missing about this. + +Here's the output: +``` +**************** +** SCAN TESTS ** +**************** + [ 34 24 8 17 15 39 31 25 19 9 26 19 26 ... 8 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0.7537ms (std::chrono Measured) + [ 0 34 58 66 83 98 137 168 193 212 221 247 266 ... 12833463 12833471 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 0.7565ms (std::chrono Measured) + [ 0 34 58 66 83 98 137 168 193 212 221 247 266 ... 12833421 12833452 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 0.510144ms (CUDA Measured) + [ 0 34 58 66 83 98 137 168 193 212 221 247 266 ... 12833463 12833471 ] + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.507712ms (CUDA Measured) + [ 0 34 58 66 83 98 137 168 193 212 221 247 266 ... 0 0 ] + passed +==== work-efficient scan, power-of-two ==== +CUDA error: cudaFree failed on dev_data!: an illegal memory access was encountered +``` diff --git a/data.xlsx b/data.xlsx new file mode 100644 index 0000000..7071752 Binary files /dev/null and b/data.xlsx differ diff --git a/img/blocks.PNG b/img/blocks.PNG new file mode 100644 index 0000000..ab9f3cf Binary files /dev/null and b/img/blocks.PNG differ diff --git a/img/compaction.PNG b/img/compaction.PNG new file mode 100644 index 0000000..056e6d2 Binary files /dev/null and b/img/compaction.PNG differ diff --git a/img/scan.PNG b/img/scan.PNG new file mode 100644 index 0000000..688940f Binary files /dev/null and b/img/scan.PNG differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..8029832 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 17; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; @@ -51,7 +51,7 @@ int main(int argc, char* argv[]) { printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan @@ -64,35 +64,35 @@ int main(int argc, char* argv[]) { printDesc("naive scan, non-power-of-two"); StreamCompaction::Naive::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); StreamCompaction::Efficient::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("thrust scan, power-of-two"); StreamCompaction::Thrust::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("thrust scan, non-power-of-two"); StreamCompaction::Thrust::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); printf("\n"); @@ -137,14 +137,14 @@ int main(int argc, char* argv[]) { printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); count = StreamCompaction::Efficient::compact(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); system("pause"); // stop Win32 console from closing on exit diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d63..c6320b9 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -1,4 +1,5 @@ #include "common.h" +#include "device_launch_parameters.h" void checkCUDAErrorFn(const char *msg, const char *file, int line) { cudaError_t err = cudaGetLastError(); @@ -23,7 +24,10 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int index = threadIdx.x + (blockDim.x * blockIdx.x); + if (index >= n) return; + + bools[index] = idata[index] != 0 ? 1 : 0; } /** @@ -32,8 +36,10 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO - } + int index = threadIdx.x + (blockDim.x * blockIdx.x); + if (index >= n) return; + if (bools[index] == 1) odata[indices[index]] = idata[index]; + } } } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index d2c1fed..3624d0a 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -12,6 +12,7 @@ #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) +#define blockSize 128 /** * Check for CUDA errors; print and exit if there was a problem. @@ -108,6 +109,16 @@ namespace StreamCompaction { return prev_elapsed_time_gpu_milliseconds; } + bool getCpuTimerStarted() + { + return cpu_timer_started; + } + + bool getGpuTimerStarted() + { + return gpu_timer_started; + } + // remove copy and move functions PerformanceTimer(const PerformanceTimer&) = delete; PerformanceTimer(PerformanceTimer&&) = delete; diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..fc3b3dc 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,4 +1,5 @@ #include +#include #include "cpu.h" #include "common.h" @@ -18,9 +19,18 @@ namespace StreamCompaction { * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. */ void scan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); + bool localStartCall = false; + if (!timer().getCpuTimerStarted()) { + timer().startCpuTimer(); + localStartCall = true; + } + + odata[0] = 0; // identity for exclusive scan + for (int i = 1; i < n; i++) { + odata[i] = odata[i - 1] + idata[i - 1]; + } + + if (localStartCall) timer().endCpuTimer(); } /** @@ -30,9 +40,15 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + int numElts = 0; + for (int i = 0; i < n; i++) { + if (idata[i] != 0) { + odata[numElts] = idata[i]; + numElts++; + } + } timer().endCpuTimer(); - return -1; + return numElts; } /** @@ -42,9 +58,20 @@ namespace StreamCompaction { */ int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + int *boolArray = new int[n]; + int *scanResult = new int[n]; + int numElts = 0; + + for (int i = 0; i < n; i++) idata[i] != 0 ? boolArray[i] = 1 : boolArray[i] = 0; + scan(n, scanResult, boolArray); + for (int i = 0; i < n; i++) { + if (boolArray[i] == 1) { + odata[scanResult[i]] = idata[i]; + numElts++; + } + } timer().endCpuTimer(); - return -1; + return numElts; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..7abed9c 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -1,6 +1,7 @@ #include #include #include "common.h" +#include "device_launch_parameters.h" #include "efficient.h" namespace StreamCompaction { @@ -12,13 +13,68 @@ namespace StreamCompaction { return timer; } + __global__ void kernUpSweep(int pow2, int depth, int *dev_data) { + int index = threadIdx.x + (blockDim.x * blockIdx.x); + if (index >= pow2) return; + + int currSpacing = 1 << (depth + 1); + if (index % currSpacing == 0) { + dev_data[index + currSpacing - 1] += dev_data[index + (currSpacing >> 1) - 1]; + } + } + + __global__ void kernDownSweep(int pow2, int depth, int *dev_data) { + int index = threadIdx.x + (blockDim.x * blockIdx.x); + if (index >= pow2) return; + + int currSpacing = 1 << (depth + 1); + int halfSpacing = 1 << depth; + if (index % currSpacing == 0) { + int temp = dev_data[index + halfSpacing - 1]; + dev_data[index + halfSpacing - 1] = dev_data[index + currSpacing - 1]; + dev_data[index + currSpacing - 1] += temp; + } + } + + + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); + int maxDepth = ilog2ceil(n); + int pow2 = 1 << maxDepth; + + int *dev_data; + cudaMalloc((void**)&dev_data, pow2 * sizeof(int)); + checkCUDAErrorFn("cudaMalloc failed on dev_data!"); + + cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + bool startedLocal = false; + if (!timer().getGpuTimerStarted()) { + timer().startGpuTimer(); + startedLocal = true; + } + + dim3 fullBlocksPerGrid((pow2 + blockSize - 1) / blockSize); + for (int d = 0; d <= maxDepth - 1; d++) { + kernUpSweep << > > (pow2, d, dev_data); + } + + cudaMemset(&dev_data[pow2 - 1], 0, sizeof(int)); + for (int d = maxDepth - 1; d >= 0; d--) { + kernDownSweep << > > (pow2, d, dev_data); + } + + if (startedLocal) { + timer().endGpuTimer(); + } + + cudaMemcpy(odata, dev_data, n * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(dev_data); + checkCUDAErrorFn("cudaFree failed on dev_data!"); } /** @@ -31,10 +87,57 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + + int* host_bools = new int[n]; + int out = 0; + + int* dev_odata; + int* dev_idata; + int* dev_bools; + int* dev_indices; + + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc failed on dev_odata!"); + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc failed on dev_idata!"); + cudaMalloc((void**)&dev_bools, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc failed on dev_bools!"); + cudaMalloc((void**)&dev_indices, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc failed on dev_indices!"); + + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + timer().startGpuTimer(); - // TODO + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + Common::kernMapToBoolean << > > (n, dev_bools, dev_idata); + + int maxDepth = ilog2ceil(n); + int pow2 = 1 << maxDepth; + + scan(n, dev_indices, dev_bools); + + Common::kernScatter << > > (n, dev_odata, dev_idata, dev_bools, dev_indices); + timer().endGpuTimer(); - return -1; + + cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(host_bools, dev_bools, n * sizeof(int), cudaMemcpyDeviceToHost); + for (int i = 0; i < n; i++) { + if (host_bools[i] == 1) { + out++; + } + } + + cudaFree(dev_odata); + checkCUDAErrorFn("cudaFree failed on dev_odata!"); + cudaFree(dev_idata); + checkCUDAErrorFn("cudaFree failed on dev_idata!"); + cudaFree(dev_bools); + checkCUDAErrorFn("cudaFree failed on dev_bools!"); + cudaFree(dev_indices); + checkCUDAErrorFn("cudaFree failed on dev_indices!"); + + return out; } } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..ebc2ec4 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -1,6 +1,7 @@ #include #include #include "common.h" +#include "device_launch_parameters.h" #include "naive.h" namespace StreamCompaction { @@ -11,15 +12,51 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - // TODO: __global__ + + __global__ void kernScanNaive(int n, int depth, const int *dev_src, int *dev_dest) { + int index = threadIdx.x + (blockDim.x * blockIdx.x); + if (index >= n) return; + + int depthPow = 1 << (depth - 1); + if (index >= depthPow) { // update curr + dev_dest[index] = dev_src[index - depthPow] + dev_src[index]; + } + else { // update from previous + dev_dest[index] = dev_src[index]; + } + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int* dev_A; + int* dev_B; + cudaMalloc((void**)&dev_A, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc failed on dev_A!"); + cudaMalloc((void**)&dev_B, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc failed on dev_B!"); + + cudaMemcpy(dev_A, idata, n * sizeof(int), cudaMemcpyHostToDevice); + cudaMemcpy(dev_B, idata, n * sizeof(int), cudaMemcpyHostToDevice); + timer().startGpuTimer(); - // TODO + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + int maxDepth = ilog2ceil(n); + for (int d = 1; d <= maxDepth; d++) { + kernScanNaive << > > (n, d, dev_A, dev_B); + std::swap(dev_A, dev_B); + } timer().endGpuTimer(); + + cudaMemcpy(odata + 1, dev_A, (n - 1) * sizeof(int), cudaMemcpyDeviceToHost); // copy over with shift for exclusive scan + odata[0] = 0; // set ident + + cudaFree(dev_A); + checkCUDAErrorFn("cudaFree failed on dev_A!"); + cudaFree(dev_B); + checkCUDAErrorFn("cudaFree failed on dev_B!"); } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..c07c32c 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -18,11 +18,30 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int* dev_odata; + int* dev_idata; + + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc failed on dev_odata!"); + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc failed on dev_idata!"); + + thrust::host_vector host_thrust_odata(odata, odata + n); + thrust::device_vector dev_thrust_odata = (thrust::device_vector) host_thrust_odata; + thrust::host_vector host_thrust_idata(idata, idata + n); + thrust::device_vector dev_thrust_idata = (thrust::device_vector) host_thrust_idata; + timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::exclusive_scan(dev_thrust_idata.begin(), dev_thrust_idata.end(), dev_thrust_odata.begin()); timer().endGpuTimer(); + + host_thrust_odata = (thrust::host_vector) dev_thrust_odata; + thrust::copy(host_thrust_odata.begin(), host_thrust_odata.end(), odata); + + cudaFree(dev_odata); + checkCUDAErrorFn("cudaFree failed on dev_odata!"); + cudaFree(dev_idata); + checkCUDAErrorFn("cudaFree failed on dev_idata!"); } } }