diff --git a/README.md b/README.md
index 4535eea..f1a21c0 100644
--- a/README.md
+++ b/README.md
@@ -3,220 +3,70 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
-
-### (TODO: Your README)
-
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
-
-Instructions (delete me)
-========================
-
-This is due Sunday, September 13 at midnight.
-
-**Summary:** In this project, you'll implement GPU stream compaction in CUDA,
-from scratch. This algorithm is widely used, and will be important for
-accelerating your path tracer project.
-
-Your stream compaction implementations in this project will simply remove `0`s
-from an array of `int`s. In the path tracer, you will remove terminated paths
-from an array of rays.
-
-In addition to being useful for your path tracer, this project is meant to
-reorient your algorithmic thinking to the way of the GPU. On GPUs, many
-algorithms can benefit from massive parallelism and, in particular, data
-parallelism: executing the same code many times simultaneously with different
-data.
-
-You'll implement a few different versions of the *Scan* (*Prefix Sum*)
-algorithm. First, you'll implement a CPU version of the algorithm to reinforce
-your understanding. Then, you'll write a few GPU implementations: "naive" and
-"work-efficient." Finally, you'll use some of these to implement GPU stream
-compaction.
-
-**Algorithm overview & details:** There are two primary references for details
-on the implementation of scan and stream compaction.
-
-* The [slides on Parallel Algorithms](https://github.com/CIS565-Fall-2015/cis565-fall-2015.github.io/raw/master/lectures/2-Parallel-Algorithms.pptx)
-  for Scan, Stream Compaction, and Work-Efficient Parallel Scan.
-* GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html).
-
-Your GPU stream compaction implementation will live inside of the
-`stream_compaction` subproject. This way, you will be able to easily copy it
-over for use in your GPU path tracer.
-
-
-## Part 0: The Usual
-
-This project (and all other CUDA projects in this course) requires an NVIDIA
-graphics card with CUDA capability. Any card with Compute Capability 2.0
-(`sm_20`) or greater will work. Check your GPU on this
-[compatibility table](https://developer.nvidia.com/cuda-gpus).
-If you do not have a personal machine with these specs, you may use those
-computers in the Moore 100B/C which have supported GPUs.
-
-**HOWEVER**: If you need to use the lab computer for your development, you will
-not presently be able to do GPU performance profiling. This will be very
-important for debugging performance bottlenecks in your program.
-
-### Useful existing code
-
-* `stream_compaction/common.h`
-  * `checkCUDAError` macro: checks for CUDA errors and exits if there were any.
-  * `ilog2ceil(x)`: computes the ceiling of log2(x), as an integer.
-* `main.cpp`
-  * Some testing code for your implementations.
-
-**Note 1:** The tests will simply compare against your CPU implementation
-Do it first!
-
-**Note 2:** The tests default to an array of size 256.
-Test with something larger (10,000?), too!
-
-
-## Part 1: CPU Scan & Stream Compaction
-
-This stream compaction method will remove `0`s from an array of `int`s.
-
-Do this first, and double check the output! It will be used as the expected
-value for the other tests.
-
-In `stream_compaction/cpu.cu`, implement:
-
-* `StreamCompaction::CPU::scan`: compute an exclusive prefix sum.
-* `StreamCompaction::CPU::compactWithoutScan`: stream compaction without using
-  the `scan` function.
-* `StreamCompaction::CPU::compactWithScan`: stream compaction using the `scan`
-  function. Map the input array to an array of 0s and 1s, scan it, and use
-  scatter to produce the output. You will need a **CPU** scatter implementation
-  for this (see slides or GPU Gems chapter for an explanation).
-
-These implementations should only be a few lines long.
-
-
-## Part 2: Naive GPU Scan Algorithm
-
-In `stream_compaction/naive.cu`, implement `StreamCompaction::Naive::scan`
-
-This uses the "Naive" algorithm from GPU Gems 3, Section 39.2.1. We haven't yet
-taught shared memory, and you **shouldn't use it yet**. Example 39-1 uses
-shared memory, but is limited to operating on very small arrays! Instead, write
-this using global memory only. As a result of this, you will have to do
-`ilog2ceil(n)` separate kernel invocations.
-
-Beware of errors in Example 39-1 in the book; both the pseudocode and the CUDA
-code in the online version of Chapter 39 are known to have a few small errors
-(in superscripting, missing braces, bad indentation, etc.)
-
-Since the parallel scan algorithm operates on a binary tree structure, it works
-best with arrays with power-of-two length. Make sure your implementation works
-on non-power-of-two sized arrays (see `ilog2ceil`). This requires extra memory
-- your intermediate array sizes will need to be rounded to the next power of
-two.
-
-
-## Part 3: Work-Efficient GPU Scan & Stream Compaction
-
-### 3.1. Scan
-
-In `stream_compaction/efficient.cu`, implement
-`StreamCompaction::Efficient::scan`
-
-All of the text in Part 2 applies.
-
-* This uses the "Work-Efficient" algorithm from GPU Gems 3, Section 39.2.2.
-* Beware of errors in Example 39-2.
-* Test non-power-of-two sized arrays.
-
-### 3.2. Stream Compaction
-
-This stream compaction method will remove `0`s from an array of `int`s.
-
-In `stream_compaction/efficient.cu`, implement
-`StreamCompaction::Efficient::compact`
-
-For compaction, you will also need to implement the scatter algorithm presented
-in the slides and the GPU Gems chapter.
-
-In `stream_compaction/common.cu`, implement these for use in `compact`:
-
-* `StreamCompaction::Common::kernMapToBoolean`
-* `StreamCompaction::Common::kernScatter`
-
-
-## Part 4: Using Thrust's Implementation
-
-In `stream_compaction/thrust.cu`, implement:
-
-* `StreamCompaction::Thrust::scan`
-
-This should be a very short function which wraps a call to the Thrust library
-function `thrust::exclusive_scan(first, last, result)`.
-
-To measure timing, be sure to exclude memory operations by passing
-`exclusive_scan` a `thrust::device_vector` (which is already allocated on the
-GPU).  You can create a `thrust::device_vector` by creating a
-`thrust::host_vector` from the given pointer, then casting it.
-
-
-## Part 5: Radix Sort (Extra Credit) (+10)
-
-Add an additional module to the `stream_compaction` subproject. Implement radix
-sort using one of your scan implementations. Add tests to check its correctness.
-
-
-## Write-up
-
-1. Update all of the TODOs at the top of this README.
-2. Add a description of this project including a list of its features.
-3. Add your performance analysis (see below).
-
-All extra credit features must be documented in your README, explaining its
-value (with performance comparison, if applicable!) and showing an example how
-it works. For radix sort, show how it is called and an example of its output.
-
-Always profile with Release mode builds and run without debugging.
-
-### Questions
-
-* Roughly optimize the block sizes of each of your implementations for minimal
-  run time on your GPU.
-  * (You shouldn't compare unoptimized implementations to each other!)
-
-* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and
-  Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
-  (with array size on the independent axis).
-  * You should use CUDA events for timing. Be sure **not** to include any
-    explicit memory operations in your performance measurements, for
-    comparability.
-  * To guess at what might be happening inside the Thrust implementation, take
-    a look at the Nsight timeline for its execution.
-
-* Write a brief explanation of the phenomena you see here.
-  * Can you find the performance bottlenecks? Is it memory I/O? Computation? Is
-    it different for each implementation?
-
-* Paste the output of the test program into a triple-backtick block in your
-  README.
-  * If you add your own tests (e.g. for radix sort or to test additional corner
-    cases), be sure to mention it explicitly.
-
-These questions should help guide you in performance analysis on future
-assignments, as well.
-
-## Submit
-
-If you have modified any of the `CMakeLists.txt` files at all (aside from the
-list of `SOURCE_FILES`), you must test that your project can build in Moore
-100B/C. Beware of any build issues discussed on the Google Group.
-
-1. Open a GitHub pull request so that we can see that you have finished.
-   The title should be "Submission: YOUR NAME".
-2. Send an email to the TA (gmail: kainino1+cis565@) with:
-   * **Subject**: in the form of `[CIS565] Project 2: PENNKEY`
-   * Direct link to your pull request on GitHub
-   * In the form of a grade (0-100+) with comments, evaluate your own
-     performance on the project.
-   * Feedback on the project itself, if any.
+* Ratchpak (Dome) Pongmongkol
+* Tested on: OSX Yosemite 10.10.5, i7 @ 2.4GHz 16GB, GT 650M 1024MB (rMBP Early 2013)
+
+* For block sizing, I implemented a function "findOptimizedSize" in common.h. 
+The strategy is to spread out the thread to several blocks as much as possible (gridDim < 16)
+
+# Analysis
+
+For N = 256, the execution time of each methods are as follows
+Thrust < Naive < Work-Efficient
+Which is quite unexpected at first, as the 'work-efficient' one is supposed to be faster than
+the 'naive' one. 
+
+My speculation is that the 'Work-Efficient' one will only shine when N is larger than the 
+maximum concurrent thread the graphics card can handle (which means, for the naive one, 
+we would need to divide N threads into N/maxThread batches for every step. Meanwhile, for 'Work-Efficient', 
+the number of threads for most step will be comparatively, and substantially, lower than its counterpart.
+
+Also, the current implementation of 'work-efficient' requires a lot of global
+memory access, which substantially build up the access delay. Given that its calculation shows the
+locality property, the speed of the calculation should be substantially lowered if the calculations
+happen on the shared memory instead.
+
+## Example Output
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  38  19  38  37   5  47  15  35   0  12   3   0  42 ...   26   0 ]
+==== cpu scan, power-of-two ====
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 6203 6229 ]
+==== cpu scan, non-power-of-two ====
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 6146 6190 ]
+    passed
+==== naive scan, power-of-two ====
+    passed
+==== naive scan, non-power-of-two ====
+    passed
+==== work-efficient scan, power-of-two ====
+    passed
+==== work-efficient scan, non-power-of-two ====
+    passed
+==== thrust scan, power-of-two ====
+    passed
+==== thrust scan, non-power-of-two ====
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   2   3   2   1   3   1   1   1   2   0   1   0   2 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   1 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   3   2 ]
+    passed
+==== cpu compact with scan ====
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   1 ]
+    passed
+==== work-efficient compact, power-of-two ====
+    passed
+==== work-efficient compact, non-power-of-two ====
+    passed
+
+```
\ No newline at end of file
diff --git a/src/main.cpp b/src/main.cpp
index 675da35..0ead633 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -6,6 +6,7 @@
  * @copyright University of Pennsylvania
  */
 
+#include <iostream>
 #include <cstdio>
 #include <stream_compaction/cpu.h>
 #include <stream_compaction/naive.h>
@@ -14,7 +15,7 @@
 #include "testing_helpers.hpp"
 
 int main(int argc, char* argv[]) {
-    const int SIZE = 1 << 8;
+    const int SIZE = 1 << 16;
     const int NPOT = SIZE - 3;
     int a[SIZE], b[SIZE], c[SIZE];
 
@@ -120,4 +121,7 @@ int main(int argc, char* argv[]) {
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
+
+	int exit;
+	std::cin >> exit;
 }
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index fe872d4..9431045 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -1,4 +1,5 @@
 #include "common.h"
+#include "device_launch_parameters.h"
 
 void checkCUDAErrorFn(const char *msg, const char *file, int line) {
     cudaError_t err = cudaGetLastError();
@@ -24,6 +25,13 @@ namespace Common {
  */
 __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
     // TODO
+	int idx = (blockDim.x * blockIdx.x) + threadIdx.x;
+	if (idx > n) return;
+
+	if (idata[idx] == 0)
+		bools[idx] = 0;
+	else
+		bools[idx] = 1;
 }
 
 /**
@@ -33,7 +41,12 @@ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
 __global__ void kernScatter(int n, int *odata,
         const int *idata, const int *bools, const int *indices) {
     // TODO
-}
+	int idx = (blockDim.x * blockIdx.x) + threadIdx.x;
+	if (idx > n) return;
 
+	if (bools[idx] == 1)
+		odata[indices[idx]] = idata[idx];
 }
+
 }
+}
\ No newline at end of file
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
index 4f52663..a0b76d6 100644
--- a/stream_compaction/common.h
+++ b/stream_compaction/common.h
@@ -3,6 +3,7 @@
 #include <cstdio>
 #include <cstring>
 #include <cmath>
+#include <cuda_runtime.h>
 
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
@@ -33,3 +34,32 @@ namespace Common {
             const int *idata, const int *bools, const int *indices);
 }
 }
+
+inline void findOptimizedSize(const int n, int& gridDim, int& blockDim){
+	//assuming that cc >= 3.0, max gridDim = 16.
+
+	cudaDeviceProp prop;
+	cudaGetDeviceProperties(&prop, 0);
+
+	if (n > prop.multiProcessorCount * prop.maxThreadsPerMultiProcessor) {
+		blockDim = prop.maxThreadsPerMultiProcessor / 16;
+		gridDim = ceil(n / (float)blockDim);
+		return;
+}	
+
+	int diff;
+	gridDim = 16 * prop.multiProcessorCount;
+	blockDim = ceil(n / (float)gridDim);
+	if (blockDim < 32) {
+		blockDim = 32;
+		gridDim = ceil(n / (float)blockDim);
+	}
+	
+	if (blockDim > prop.maxThreadsPerBlock)
+	{
+		diff = (blockDim - prop.maxThreadsPerBlock) * gridDim;
+		int additionalGrid = diff / prop.maxThreadsPerBlock;
+		gridDim += additionalGrid;
+		blockDim = prop.maxThreadsPerBlock;
+	} 
+}
\ No newline at end of file
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index e600c29..ffdb197 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -1,15 +1,24 @@
 #include <cstdio>
 #include "cpu.h"
+#include <stdlib.h> 
+#include <chrono> 
+#include <iostream>
 
 namespace StreamCompaction {
 namespace CPU {
-
 /**
  * CPU scan (prefix sum).
  */
 void scan(int n, int *odata, const int *idata) {
-    // TODO
-    printf("TODO\n");
+	auto begin = std::chrono::high_resolution_clock::now();
+	odata[0] = 0;
+	for (int i = 1; i < n; i++)
+		odata[i] = odata[i - 1] + idata[i - 1];
+
+	auto end = std::chrono::high_resolution_clock::now();
+	auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
+
+	std::cout << "CPU - time : " << duration << " ms" << std::endl;
 }
 
 /**
@@ -19,7 +28,15 @@ void scan(int n, int *odata, const int *idata) {
  */
 int compactWithoutScan(int n, int *odata, const int *idata) {
     // TODO
-    return -1;
+	int elemNo = 0;
+
+	for (int i = 0; i < n; i++)
+		if (idata[i] != 0){
+			odata[elemNo] = idata[i];
+			elemNo++;
+		}
+
+	return elemNo;
 }
 
 /**
@@ -29,7 +46,27 @@ int compactWithoutScan(int n, int *odata, const int *idata) {
  */
 int compactWithScan(int n, int *odata, const int *idata) {
     // TODO
-    return -1;
+	int *tmp = (int*) malloc(sizeof(int) * n);
+
+	//compute temp array
+	for (int i = 0; i < n; i++) {
+		if (idata[i] == 0)
+			tmp[i] = 0;
+		else
+			tmp[i] = 1;
+	}
+
+	//scan
+	int *s = (int*)malloc(sizeof(int) * n);
+	scan(n, s, tmp);
+
+	//scatter
+	for (int i = 0; i < n; i++)
+		if (tmp[i] == 1){
+			odata[s[i]] = idata[i];
+		}
+
+	return s[n-1];
 }
 
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index b2f739b..1b1cf82 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -2,18 +2,152 @@
 #include <cuda_runtime.h>
 #include "common.h"
 #include "efficient.h"
+#include "device_launch_parameters.h"
+
+#include <iostream>
 
 namespace StreamCompaction {
 namespace Efficient {
+	
+__device__  __host__ int pow(int n, int p){
+	int out = 1;
+	for (int i = 0; i < p; i++)
+		out *= n;
+	return out;
+};
+
+__device__  __host__ void swap(int** x, int** y){
+	int *tmp = *x;
+	*x = *y;
+	*y = tmp;
+};
 
-// TODO: __global__
+__global__ void copyModifiedUpSweep(int* out, int* arr, int n, int p){
+	int k = ((blockDim.x * blockIdx.x) + threadIdx.x) * p;
+	if (k >= n) return;
+
+	out[k + p - 1] = arr[k + p - 1];
+}
+
+__global__ void upSweep(int* out, int* arr, int n, int d, int p){
+	int k = ((blockDim.x * blockIdx.x) + threadIdx.x) * p;
+	if (k >= n) return;
+
+	out[k + p - 1] = arr[k + p - 1] + arr[k + pow(2, d) - 1];
+}
+
+__global__ void setValArray(int* arr, int index, int value){
+	arr[index] = value;
+}
+
+__global__ void copyModifiedDownSweep(int* out, int* arr, int n, int d, int p){
+	int k = ((blockDim.x * blockIdx.x) + threadIdx.x) * p;
+	if (k >= n) return;
+
+	int p_d = pow(2, d);
+
+	out[k + p - 1] = arr[k + p - 1];
+	out[k + p_d - 1] = arr[k + p_d - 1];
+}
+
+__global__ void downSweep(int* out, int* arr, int n, int d, int p){
+	int k = ((blockDim.x * blockIdx.x) + threadIdx.x) * p;
+	if (k >= n) return;
+
+	int p_d = pow(2, d);
+
+	out[k + p - 1] = arr[k + p - 1] + arr[k + p_d - 1];
+	out[k + p_d - 1] = arr[k + p - 1];
+}
+
+__global__ void setExclusive(int* out, int* in, int n){
+	int k = (blockDim.x * blockIdx.x) + threadIdx.x;
+	if (k >= n) return;
+
+	out[k] = k > 0 ? in[k - 1] : 0;
+}
+
+
+__global__ void shiftBit(int* out, int* in){
+	int k = (blockDim.x * blockIdx.x) + threadIdx.x;
+	out[k] = k > 0 ? in[k - 1] : 0;
+}
 
 /**
  * Performs prefix-sum (aka scan) on idata, storing the result into odata.
  */
 void scan(int n, int *odata, const int *idata) {
-    // TODO
-    printf("TODO\n");
+	int log2ceil = ilog2ceil(n);
+	int log2 = ilog2(n);
+	int n_p2 = pow(2, log2ceil);
+
+	int size = sizeof(int) * n;
+	int size_n_p2 = sizeof(int) * n_p2;
+
+	int *d_idata, *d_odata;
+	cudaMalloc(&d_idata, size_n_p2);
+	cudaMalloc(&d_odata, size_n_p2);
+	cudaMemcpy(d_idata, idata, size, cudaMemcpyHostToDevice);
+	cudaMemset(&d_idata[n], 0, size_n_p2 - size);
+
+	int blockDim, gridDim;
+	findOptimizedSize(n_p2, gridDim, blockDim);
+	//std::cout << "blockDim : " << blockDim << std::endl;
+
+	//////////////////////////
+	float time = 0;
+	cudaEvent_t beginEvent_acc;
+	cudaEvent_t endEvent_acc;
+
+	cudaEventCreate(&beginEvent_acc);
+	cudaEventCreate(&endEvent_acc);
+	cudaEventRecord(beginEvent_acc, 0);
+
+	for (int d = 0; d <= log2ceil - 1; d++){
+		int p = pow(2, d + 1);
+		findOptimizedSize(n_p2 / p, gridDim, blockDim);
+		upSweep << < gridDim, blockDim >> >(d_odata, d_idata, n_p2, d, p);
+		copyModifiedUpSweep << < gridDim, blockDim >> >(d_idata, d_odata, n_p2, p);
+	}
+	
+	setValArray << < 1, 1 >> >(d_idata, n_p2 - 1, 0);
+
+	for (int d = log2ceil - 1; d >= 0; d--){
+		int p = pow(2, d + 1);
+		findOptimizedSize(n_p2 / p, gridDim, blockDim);
+		downSweep << < gridDim, blockDim >> >(d_odata, d_idata, n_p2, d, p);
+		copyModifiedDownSweep << < gridDim, blockDim >> >(d_idata, d_odata, n_p2, d, p);
+	}
+
+	cudaEventRecord(endEvent_acc, 0);
+	cudaEventSynchronize(endEvent_acc);
+	cudaEventElapsedTime(&time, beginEvent_acc, endEvent_acc);
+
+	//setExclusive << < blockDim, gridDim >> >(d_odata, d_idata, n);
+
+	std::cout << "Work Efficient - time : " << time << " ms" << std::endl;
+
+	//////////////////////////
+	//test
+	cudaMemcpy(odata, d_idata, size, cudaMemcpyDeviceToHost);
+	/*
+	std::cout << "idata: ";
+	int total = 0;
+	for (int i = 0; i < n; i++) {
+		std::cout << idata[i] << " ";
+		total += idata[i];
+	}
+	std::cout << "Total : " << total << std::endl;
+	std::cout << std::endl;
+	std::cout << std::endl;
+	std::cout << "odata: ";
+	for (int i = 0; i < n; i++) {
+		std::cout << odata[i] << " ";
+	}
+	std::cout << std::endl;
+	*/
+	cudaFree(d_odata);
+	cudaFree(d_idata);
 }
 
 /**
@@ -25,10 +159,48 @@ void scan(int n, int *odata, const int *idata) {
  * @param idata  The array of elements to compact.
  * @returns      The number of elements remaining after compaction.
  */
+
 int compact(int n, int *odata, const int *idata) {
     // TODO
-    return -1;
-}
+	int size = sizeof(int) * n;
 
+	//////////////////////
+
+	int *d_idata;
+	cudaMalloc(&d_idata, size);
+	cudaMemcpy(d_idata, idata, size, cudaMemcpyHostToDevice);
+
+	int* h_indices = (int*)malloc(sizeof(int) * n);
+	int* d_indices;
+	cudaMalloc(&d_indices, size);
+
+	int* h_bools = (int*)malloc(sizeof(int) * n);
+	int* d_bools;
+	cudaMalloc(&d_bools, size);
+
+	int *d_odata;
+	cudaMalloc(&d_odata, size);
+
+	//////////////////////////
+
+	int blockDim, gridDim;
+	findOptimizedSize(n, gridDim, blockDim);
+
+	StreamCompaction::Common::kernMapToBoolean << < gridDim, blockDim >> >(n, d_bools, d_idata);
+	cudaMemcpy(h_bools, d_bools, size, cudaMemcpyDeviceToHost);
+	StreamCompaction::Efficient::scan(n, h_indices, h_bools);
+
+	cudaMemcpy(d_indices, h_indices, size, cudaMemcpyHostToDevice);
+	StreamCompaction::Common::kernScatter << < gridDim, blockDim >> >(n, d_odata, d_idata, d_bools, d_indices);
+
+	cudaMemcpy(odata, d_odata, size, cudaMemcpyDeviceToHost);
+
+	cudaFree(d_idata);
+	cudaFree(d_odata);
+	cudaFree(d_indices);
+	cudaFree(d_bools);
+
+	return h_indices[n - 1] + h_bools[n - 1];
+}
 }
 }
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 3d86b60..e36fcd1 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -2,18 +2,88 @@
 #include <cuda_runtime.h>
 #include "common.h"
 #include "naive.h"
+#include "device_launch_parameters.h"
+
+#include <iostream>
 
 namespace StreamCompaction {
 namespace Naive {
 
-// TODO: __global__
+__device__  __host__ int pow(int n, int p){
+	int out = 1;
+	for (int i = 0; i < p; i++)
+		out *= n;
+	return out;
+};
+
+__device__  __host__ void swap(int** x, int** y){
+	int *tmp = *x;
+	*x = *y;
+	*y = tmp;
+};
+
+__global__ void gpuScan(int* out, int* in, int n, int p){
+	int k = (blockDim.x * blockIdx.x) + threadIdx.x;
+	if (k >= n) return;
+
+	if (k >= p)
+		out[k] = in[k - p] + in[k];
+	else
+		out[k] = in[k];
+}
 
+__global__ void shiftBit(int* out, int* in){
+	int k = (blockDim.x * blockIdx.x) + threadIdx.x;
+	out[k] = k > 0 ? in[k - 1] : 0;
+}
 /**
- * Performs prefix-sum (aka scan) on idata, storing the result into odata.
- */
+* Performs prefix-sum (aka scan) on idata, storing the result into odata.
+*/
 void scan(int n, int *odata, const int *idata) {
-    // TODO
-    printf("TODO\n");
+
+	int log2ceil = ilog2ceil(n);
+	int log2 = ilog2(n);
+	int n_p2 = pow(2, log2ceil);
+
+	int size = sizeof(int) * n;
+	int size_n_p2 = sizeof(int) * n_p2;
+
+	int *d_idata, *d_odata;
+	cudaMalloc(&d_idata, size_n_p2);
+	cudaMalloc(&d_odata, size_n_p2);
+	cudaMemcpy(d_idata, idata, size, cudaMemcpyHostToDevice);
+	cudaMemset(d_idata + size, 0, size_n_p2 - size);
+
+	int blockDim, gridDim;
+	findOptimizedSize(n_p2, gridDim, blockDim);
+	//std::cout << "blockDim : " << blockDim << std::endl;
+
+	//////////////////////////
+	float time = 0;
+	cudaEvent_t beginEvent_acc;
+	cudaEvent_t endEvent_acc;
+
+	cudaEventCreate(&beginEvent_acc);
+	cudaEventCreate(&endEvent_acc);
+	cudaEventRecord(beginEvent_acc, 0);
+
+	for (int d = 1; d <= log2ceil; d++) {
+		int p = pow(2, d - 1);
+		gpuScan << < gridDim, blockDim >> >(d_odata, d_idata, n_p2, p);
+		swap(&d_odata, &d_idata);
+	}
+	shiftBit<< < gridDim, blockDim >> >(d_odata, d_idata);
+
+	cudaEventRecord(endEvent_acc, 0);
+	cudaEventSynchronize(endEvent_acc);
+	cudaEventElapsedTime(&time, beginEvent_acc, endEvent_acc);
+
+	std::cout << "Naive - time : " << time << " ms" << std::endl;
+	//////////////////////////
+	cudaMemcpy(odata, d_odata, size, cudaMemcpyDeviceToHost);
+
+	cudaFree(d_idata);
+	cudaFree(d_odata);
 }
 
 }
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index d8dbb32..efec95a 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -8,15 +8,28 @@
 
 namespace StreamCompaction {
 namespace Thrust {
-
 /**
  * Performs prefix-sum (aka scan) on idata, storing the result into odata.
  */
-void scan(int n, int *odata, const int *idata) {
-    // TODO use `thrust::exclusive_scan`
-    // example: for device_vectors dv_in and dv_out:
-    // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
-}
+	void scan(int n, int *odata, const int *idata) {
+		// TODO use `thrust::exclusive_scan`
+		// example: for device_vectors dv_in and dv_out:
+		// thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+		float time = 0;
+		cudaEvent_t beginEvent_acc;
+		cudaEvent_t endEvent_acc;
+
+		cudaEventCreate(&beginEvent_acc);
+		cudaEventCreate(&endEvent_acc);
+		cudaEventRecord(beginEvent_acc, 0);
+
+		thrust::exclusive_scan(idata, idata + n, odata);
+
+		cudaEventRecord(endEvent_acc, 0);
+		cudaEventSynchronize(endEvent_acc);
+		cudaEventElapsedTime(&time, beginEvent_acc, endEvent_acc);
 
+		std::cout << "thrust - time : " << time << " ms" << std::endl;
+	}
 }
 }