diff --git a/README.md b/README.md
index 110697c..86bb287 100644
--- a/README.md
+++ b/README.md
@@ -3,11 +3,59 @@ CUDA Path Tracer
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 3**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Tabatha Hickman
+  * LinkedIn:https://www.linkedin.com/in/tabatha-hickman-335987140/
+* Tested on: Windows 10 Pro, i7-5600U CPU @ 2.60GHz 16GB, GeForce 840M (personal computer)
 
-### (TODO: Your README)
+## Path Tracer
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+### Main Features:
 
+The following is a rough outline of the rendering process:
+
+* A ray is generated for every pixel in the output image. Its origin is at the camera's position and its direction points toward the position in our scene which will appear at that pixel in the final image.
+* Next, intersection tests are performed to determine if and what object each ray first hits in the scene. Once we know each ray has hit anything, we next need to know what the color of the surface it hits is. This is not as simple as just taking the color of the object. A path tracer determines how to shade a scene by estimating what light from other objects in the scene would have bounced onto this object and what they might contribute to the color at that point.
+* So, for as many times as we want our path estimates to bounce around the scene (aka the depth), we determine intersections of each path with the scene then give each path a new origin at the location of this intersection and a new direction that is determined based on the type of material of the object intersected with and keep track of the colors contributed by those materials along the way.
+* This procedure, from shooting the rays into the scene from the camera to getting final color values for each pixel, is then repeated for some number of iterations. This is done because for each iteration, we can only follow one possible path per pixel when there are infinitely many possible ones. Thus, the more iterations we do and combine the results, the more accurate of a final image we get.
+
+For the first part of this assignment, I worked on simple physically based shading which would accurately accumulate color and redirect rays on each path and added some possible optimizations to the process which utilize the GPU. There were of course GPU kernels for processes like computing intersections and shading for each pixel, but there were some extra optimizations between path bounces and iterations that could make things even more efficient.
+
+**Shading:** Diffuse materials contribute the color of the material to the path and new ray direction is computed by sampling a cosine-weighted hemisphere centered around the intersection point and the intersection normal to the surface. Purely reflective materials contribute the specular color of the material to the path and the new ray direction is computed by reflecting the ray about the intersection surface normal. Here is a Cornell box scene with diffuse walls, ceiling and floor and a reflective sphere in the middle.
+
+![](img/cornellReflect.png)
+
+**Stream Compaction:** We can use stream compaction to remove any rays/paths which have terminated because they intersected with nothing. The ray can no longer bounce around the scene at this point so there's no reason to continue doing empty calculations for further depths. I used thrust::partition to accomplish this. We can see in the chart below that at lower trace depths (less than 20), using stream compaction actually causes render time to increase, probably because the overhead of the procedure is not worth the gain. However, at larger trace depths stream compaction is very useful, which makes sense because the more bounces we have to follow on each path, the more removing as many paths as possible benefits computation time.
+
+![](img/streamCompactChart.JPG)
+
+**Sorting Rays by Material Type:** We know that different types of materials take different amounts of time to compute shading and ray bounces. All the processes in one warp will take as long to compute as the slowest process in the warp, so the thought is that if we make the paths with the same materials contiguous in memory, we can reduce the amount of idle threads in the warps. However, this sorting takes some overhead and to really make this as useful as possible, it would take even more complex sorting. I didn't even put my data for this testing in this readme because the performance with sorting was atrociously worse (almost ten times worse). I sorted the paths by the materialID associated with the last intersection. This is the most simple method but it doesn't account for the fact that warps are in increments of 32 threads so even if they're sorted the likelihood that warps will still have a mix of materials is high. I could sort based on whether the material was refractive, reflective, some combination or none at all, since material IDs are not necessarily sorted in such an order, but I think this would be even more overhead for very little gain.
+
+**Caching the First Bounce:** Another possible optimization has to do with the fact that for every iteration, we should always get the same first set of intersections because we are shooting rays to the same locations in the scene. This means on the first iteration we can cache all the intersections and use this on future iterations, avoiding some amount of computation. In the chart below I measured how the render time changes as the dimension of the output image increases and used a constant trace depth of 8. There is a very small but always positive gain to using the first bounce cache that increases slightly as we increase the image size. This makes sense, as we only save one out of 8 bounce compuations on each iteration and there is some overhead to copying memory into and out of the cache.
+
+![](img/cacheChart.JPG)
+
+### Additional Features
+
+**Refractive Materials:** I added support for refractive materials, both with and without Fresnel effects. For refractive materials, I refract the ray using Snell's Law and use the material's albedo color as a contribution to the path color. If total internal reflection occurs at this point, I instead reflect and use the specular color. For object's which are both reflective and refractive, we can use Fresnel dielectrics. Basically we compute a probability that reflection or refraction happens at this point and use a random number to decide which effect to apply on this bounce. Here is a Cornell box with a refractive sphere using Fresnel dielectrics. Below that is a comparison between a refractive sphere without and with Fresnel, respectively.
+
+![](img/fresnelRefract.png)
+
+![](img/refractComp.png)![](img/fresnelComp.png)
+
+**Antialiasing:** To accomplish this, I simply added a random value between -0.5 and 0.5 to the x and y coordinates of each pixel when the initial rays were being generated. This jitters the ray slightly and over many iterations, it gives us smoother borders between shapes. This doesn't have any performance hit, but it does make it impossible to use the first bounce caching optimization described earlier. Here are two images with and without antialiasing, respectively.
+
+![](img/antiAliasingClose.png)![](img/noAntiAliasingClose.png)
+
+**Direct Lighting:** Direct lighting is a different method of path tracing that is often used in conjunction with what we have done so far to give the best, most accurate appearing renders. In direct lighting, instead of bouncing the paths around the scene some number of times, on the first bounce, we direct the ray toward one of the lights in the scene and shade the pixel based on whether or not it is occluded from that light. This method is much faster because we only require one bounce per pixel. Similarly to regular path tracing, direct lighting benefits a lot from being able to parallelize computations per pixel. However, it cannot support a lot of features like global illumination, reflection, and refraction. Reflective and refractive surfaces appear black because they don't get a chance to interact with other surfaces and gain more information. Here is a cornell box with two different colored lights (red on the left and blue on the right) rendered with direct lighting. Below that are two images of a simple cornell box with a diffuse sphere, the first rendered with direct lighting and the second without.
+
+![](img/twoLights.png)
+
+![](img/dirLight.png)![](img/noDirLight.png)
+
+**Procedural Shapes and Textures:** I created two complex procedural shapes using sine distance functions. One is a warped torus and the other is a hollow shape made with constructive solid geometry operations on a sphere, a box, and three cylinders. For the intersection tests of these shapes I used sphere marching. Given a point on the ray, I calculate the closest distance to any point on the target shape and if that distance is smaller than some epsilon, I'm close enough that I can say this point has intersected with the shape, but otherwise, I move that distance further along the ray. If this process is iterated a few times, you either find that the distances are getting smaller and eventually you reach an intersection, or you find the distances getting larger, meaning the ray did not hit the shape. Rendering these shapes seems to take a performance hit in comparison to the primitive box and sphere (415 ms per iteration for these vs 100 for prims), which makes sense because intersection tests for the primitives don't require looping or any iteration. Also, to calculate the normals of these shapes I need to call the SDF 6 times more, which would definitely affect the performance. Having this on the GPU is definitely a performance benefit, since calculations for each path can be parallelized. Though I'm not sure exactly how, it seems like SDFs should be capabe of more optimization. There's definitely redundant calculation to be evaluating the functions over and over again for each pixel and each iteration of marching, etc. Maybe there could be some discretization of the shape using a grid or something that allows us to eliminate some paths and now the general area others will hit.
+
+![](img/twist.png)![](img/hollowShape.png)
+
+I can also texture these shapes procedurally, using the results of the SDFs to decide if I should use one material or another. In the twist below I used the abs dot product of the surface normal and the z axis to decide if the material at each point should be diffuse white or diffuse blue. In the hollow shape, I used the result of the final subtraction step of my CSG to decide whether or not the point was on the interior or exterior of the shape and color it reflective pink or diffuse white respectively.
+
+![](img/twistCol.png)![](img/hollowShapeCol.png) 
\ No newline at end of file
diff --git a/img/antialiasing.png b/img/antialiasing.png
new file mode 100644
index 0000000..fd43540
Binary files /dev/null and b/img/antialiasing.png differ
diff --git a/img/antialiasingClose.png b/img/antialiasingClose.png
new file mode 100644
index 0000000..fefa0b5
Binary files /dev/null and b/img/antialiasingClose.png differ
diff --git a/img/cacheChart.JPG b/img/cacheChart.JPG
new file mode 100644
index 0000000..9d8da5b
Binary files /dev/null and b/img/cacheChart.JPG differ
diff --git a/img/cornellReflect.png b/img/cornellReflect.png
new file mode 100644
index 0000000..9408cc8
Binary files /dev/null and b/img/cornellReflect.png differ
diff --git a/img/dirLight.png b/img/dirLight.png
new file mode 100644
index 0000000..a7d1843
Binary files /dev/null and b/img/dirLight.png differ
diff --git a/img/fresnelComp.png b/img/fresnelComp.png
new file mode 100644
index 0000000..dcdd19d
Binary files /dev/null and b/img/fresnelComp.png differ
diff --git a/img/fresnelRefract.png b/img/fresnelRefract.png
new file mode 100644
index 0000000..15c799a
Binary files /dev/null and b/img/fresnelRefract.png differ
diff --git a/img/hollowShape.png b/img/hollowShape.png
new file mode 100644
index 0000000..4ff1806
Binary files /dev/null and b/img/hollowShape.png differ
diff --git a/img/hollowShapeCol.png b/img/hollowShapeCol.png
new file mode 100644
index 0000000..f1f1edc
Binary files /dev/null and b/img/hollowShapeCol.png differ
diff --git a/img/noAntialiasing.png b/img/noAntialiasing.png
new file mode 100644
index 0000000..6f3fcfa
Binary files /dev/null and b/img/noAntialiasing.png differ
diff --git a/img/noAntialiasingClose.png b/img/noAntialiasingClose.png
new file mode 100644
index 0000000..ba179f4
Binary files /dev/null and b/img/noAntialiasingClose.png differ
diff --git a/img/noDirLight.png b/img/noDirLight.png
new file mode 100644
index 0000000..1f5ef8c
Binary files /dev/null and b/img/noDirLight.png differ
diff --git a/img/refractComp.png b/img/refractComp.png
new file mode 100644
index 0000000..06a61c4
Binary files /dev/null and b/img/refractComp.png differ
diff --git a/img/streamCompactChart.JPG b/img/streamCompactChart.JPG
new file mode 100644
index 0000000..2f24164
Binary files /dev/null and b/img/streamCompactChart.JPG differ
diff --git a/img/twist.png b/img/twist.png
new file mode 100644
index 0000000..11548f7
Binary files /dev/null and b/img/twist.png differ
diff --git a/img/twistCol.png b/img/twistCol.png
new file mode 100644
index 0000000..cf5fa48
Binary files /dev/null and b/img/twistCol.png differ
diff --git a/img/twoLights.png b/img/twoLights.png
new file mode 100644
index 0000000..38d992a
Binary files /dev/null and b/img/twoLights.png differ
diff --git a/scenes/cornell.txt b/scenes/cornell.txt
index 83ff820..3141889 100644
--- a/scenes/cornell.txt
+++ b/scenes/cornell.txt
@@ -112,6 +112,6 @@ SCALE       .01 10 10
 OBJECT 6
 sphere
 material 4
-TRANS       -1 4 -1
+TRANS       -2 4 -2
 ROTAT       0 0 0
 SCALE       3 3 3
diff --git a/scenes/cornellRefract.txt b/scenes/cornellRefract.txt
new file mode 100644
index 0000000..ca5c6aa
--- /dev/null
+++ b/scenes/cornellRefract.txt
@@ -0,0 +1,117 @@
+// Emissive material (light)
+MATERIAL 0
+RGB         1 1 1
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   10
+
+// Diffuse white
+MATERIAL 1
+RGB         .98 .98 .98
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// Diffuse red
+MATERIAL 2
+RGB         .85 .35 .35
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// Diffuse green
+MATERIAL 3
+RGB         .35 .85 .35
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// Yellow Glass
+MATERIAL 4
+RGB         .9 .9 .8
+SPECEX      0
+SPECRGB     .98 .98 .98
+REFL        1
+REFR        1
+REFRIOR     1.33
+EMITTANCE   0
+
+// Camera
+CAMERA
+RES         800 800
+FOVY        45
+ITERATIONS  5000
+DEPTH       10
+FILE        cornell
+EYE         0.0 5 10.5
+LOOKAT      0 5 0
+UP          0 1 0
+
+
+// Ceiling light
+OBJECT 0
+cube
+material 0
+TRANS       0 10 0
+ROTAT       0 0 0
+SCALE       3 .3 3
+
+// Floor
+OBJECT 1
+cube
+material 1
+TRANS       0 0 0
+ROTAT       0 0 0
+SCALE       10 .01 10
+
+// Ceiling
+OBJECT 2
+cube
+material 1
+TRANS       0 10 0
+ROTAT       0 0 90
+SCALE       .01 10 10
+
+// Back wall
+OBJECT 3
+cube
+material 1
+TRANS       0 5 -5
+ROTAT       0 90 0
+SCALE       .01 10 10
+
+// Left wall
+OBJECT 4
+cube
+material 2
+TRANS       -5 5 0
+ROTAT       0 0 0
+SCALE       .01 10 10
+
+// Right wall
+OBJECT 5
+cube
+material 3
+TRANS       5 5 0
+ROTAT       0 0 0
+SCALE       .01 10 10
+
+// Sphere
+OBJECT 6
+sphere
+material 4
+TRANS       1 5 0
+ROTAT       0 0 0
+SCALE       4 4 4
diff --git a/scenes/cornellTwoLights.txt b/scenes/cornellTwoLights.txt
new file mode 100644
index 0000000..ad91d3a
--- /dev/null
+++ b/scenes/cornellTwoLights.txt
@@ -0,0 +1,125 @@
+// Emissive material (light1)
+MATERIAL 0
+RGB         .8 .8 1
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   2
+
+// Emissive material (light2)
+MATERIAL 1
+RGB         1 .8 .8
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   2
+
+// Diffuse white
+MATERIAL 2
+RGB         .98 .98 .98
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// Diffuse red
+MATERIAL 3
+RGB         .85 .35 .35
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// Diffuse green
+MATERIAL 4
+RGB         .35 .85 .35
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// Camera
+CAMERA
+RES         800 800
+FOVY        45
+ITERATIONS  5000
+DEPTH       8
+FILE        cornell
+EYE         0.0 5 10.5
+LOOKAT      0 5 0
+UP          0 1 0
+
+
+// Ceiling light 1
+OBJECT 0
+cube
+material 0
+TRANS       3 10 0
+ROTAT       0 0 0
+SCALE       3 .3 3
+
+// Ceiling light 2
+OBJECT 1
+cube
+material 1
+TRANS       -3 10 0
+ROTAT       0 0 0
+SCALE       3 .3 3
+
+// Floor
+OBJECT 2
+cube
+material 2
+TRANS       0 0 0
+ROTAT       0 0 0
+SCALE       10 .01 10
+
+// Ceiling
+OBJECT 3
+cube
+material 2
+TRANS       0 10 0
+ROTAT       0 0 90
+SCALE       .01 10 10
+
+// Back wall
+OBJECT 4
+cube
+material 2
+TRANS       0 5 -5
+ROTAT       0 90 0
+SCALE       .01 10 10
+
+// Left wall
+OBJECT 5
+cube
+material 3
+TRANS       -5 5 0
+ROTAT       0 0 0
+SCALE       .01 10 10
+
+// Right wall
+OBJECT 6
+cube
+material 4
+TRANS       5 5 0
+ROTAT       0 0 0
+SCALE       .01 10 10
+
+// Sphere
+OBJECT 7
+sphere
+material 2
+TRANS       0 4 0
+ROTAT       0 0 0
+SCALE       3 3 3
diff --git a/scenes/hollowShape.txt b/scenes/hollowShape.txt
new file mode 100644
index 0000000..d0d74da
--- /dev/null
+++ b/scenes/hollowShape.txt
@@ -0,0 +1,127 @@
+// Emissive material (light)
+MATERIAL 0
+RGB         1 1 1
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   5
+
+// Diffuse white
+MATERIAL 1
+RGB         .98 .98 .98
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// Diffuse red
+MATERIAL 2
+RGB         .85 .35 .35
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// Diffuse green
+MATERIAL 3
+RGB         .35 .85 .35
+SPECEX      0
+SPECRGB     0 0 0
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// Specular white
+MATERIAL 4
+RGB         .98 .98 .98
+SPECEX      0
+SPECRGB     .98 .98 .98
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// Diffuse blue
+MATERIAL 5
+RGB         .1 .1 .9
+SPECEX      0
+SPECRGB     .9 .1 .9
+REFL        0
+REFR        0
+REFRIOR     0
+EMITTANCE   0
+
+// Camera
+CAMERA
+RES         800 800
+FOVY        45
+ITERATIONS  5000
+DEPTH       8
+FILE        cornell
+EYE         0.0 5 10.5
+LOOKAT      0 5 0
+UP          0 1 0
+
+
+// Ceiling light
+OBJECT 0
+cube
+material 0
+TRANS       0 10 0
+ROTAT       0 0 0
+SCALE       3 .3 3
+
+// Floor
+OBJECT 1
+cube
+material 1
+TRANS       0 0 0
+ROTAT       0 0 0
+SCALE       10 .01 10
+
+// Ceiling
+OBJECT 2
+cube
+material 1
+TRANS       0 10 0
+ROTAT       0 0 90
+SCALE       .01 10 10
+
+// Back wall
+OBJECT 3
+cube
+material 1
+TRANS       0 5 -5
+ROTAT       0 90 0
+SCALE       .01 10 10
+
+// Left wall
+OBJECT 4
+cube
+material 2
+TRANS       -5 5 0
+ROTAT       0 0 0
+SCALE       .01 10 10
+
+// Right wall
+OBJECT 5
+cube
+material 3
+TRANS       5 5 0
+ROTAT       0 0 0
+SCALE       .01 10 10
+
+// Sphere
+OBJECT 6
+twist
+material 4
+TRANS       0 4 0
+ROTAT       0 45 0
+SCALE       5 5 5
diff --git a/src/interactions.h b/src/interactions.h
index 5ce3628..a0614d0 100644
--- a/src/interactions.h
+++ b/src/interactions.h
@@ -2,6 +2,7 @@
 
 #include "intersections.h"
 
+
 // CHECKITOUT
 /**
  * Computes a cosine-weighted random direction in a hemisphere.
@@ -41,6 +42,103 @@ glm::vec3 calculateRandomDirectionInHemisphere(
         + sin(around) * over * perpendicularDirection2;
 }
 
+__host__ __device__
+glm::vec3 refract(const glm::vec3 &rayDir, const glm::vec3 &normal, const float iOfR,
+    glm::vec3 diffCol, glm::vec3 specCol, glm::vec3 &col)
+{    
+    float eta;
+    glm::vec3 norm;
+
+    // if the dot between the last intersection and the normal of this intersection is negative, 
+    // we should be entering a transmissive surface, otherwise we are exiting it
+    if (glm::dot(-rayDir, normal) <= 0)
+    {
+        // leaving object
+        eta = iOfR;
+        norm = -normal;
+    }
+    else
+    {
+        // entering object
+        eta = 1.0f / iOfR;
+        norm = normal;
+    }
+
+    // Compute cos theta using Snell's law
+    float cosThetaI = glm::dot(norm, -rayDir);
+    float sin2ThetaI = glm::max(float(0), float(1 - cosThetaI * cosThetaI));
+    float sin2ThetaT = eta * eta * sin2ThetaI;
+    
+
+    // Handle total internal reflection for transmission
+    if (sin2ThetaT >= 1)
+    {
+        col *= specCol;
+        return glm::reflect(rayDir, norm);
+    }
+
+    col *= diffCol;
+    float cosThetaT = glm::sqrt(1 - sin2ThetaT);
+    return eta * rayDir + (eta * cosThetaI - cosThetaT) * glm::vec3(norm);
+    //return glm::refract(rayDir, norm, eta);
+}
+
+__host__ __device__
+glm::vec3 fresnel(const glm::vec3 &rayDir, const glm::vec3 &normal, const float iOfR, 
+    glm::vec3 diffCol, glm::vec3 specCol, glm::vec3 &col,
+    thrust::default_random_engine &rng)
+{   
+    float ei, et;
+
+    // if the dot between the last intersection and the normal of this intersection is negative, 
+    // we should be entering a transmissive surface, otherwise we are exiting it
+    if (glm::dot(rayDir, normal) <= 0)
+    {
+        // leaving object
+        ei = iOfR;
+        et = 1.0f;
+    }
+    else
+    {
+        // entering object
+        ei = iOfR;
+        et = 1.0f;
+    }
+    
+    // Compute cos theta using Snell's law
+    float cosThetaI = glm::clamp(-1.f, 1.f, glm::dot(rayDir, normal));
+    float sinThetaI = glm::sqrt(glm::max(float(0), float(1 - cosThetaI * cosThetaI)));
+    float sinThetaT = ei/et * sinThetaI;
+
+    // Handle total internal reflection for transmission
+    if (sinThetaT >= 1)
+    {
+        col *= specCol;
+        return glm::reflect(rayDir, normal);
+    }
+
+    float cosThetaT = glm::sqrt(glm::max(float(0), float(1 - sinThetaT * sinThetaT)));
+    cosThetaI = glm::abs(cosThetaI);
+
+    //find rparallel and rperpendicular
+    float rPar = ((et*cosThetaI) - (ei*cosThetaT)) / ((et*cosThetaI) + (ei*cosThetaT));
+    float rPerp = ((ei*cosThetaI) - (et*cosThetaT)) / ((ei*cosThetaI) + (et*cosThetaT));
+
+    float reflProb = (rPar*rPar + rPerp*rPerp) / 2.0f;
+    thrust::uniform_real_distribution<float> u01(0, 1);
+
+    if (u01(rng) < reflProb)
+    {
+        col *= specCol;
+        return glm::reflect(rayDir, normal);
+    }
+    else
+    {
+        return refract(rayDir, normal, iOfR, diffCol, specCol, col);
+    }
+}
+
+
 /**
  * Scatter a ray with some probabilities according to the material properties.
  * For example, a diffuse surface scatters in a cosine-weighted hemisphere.
@@ -68,7 +166,7 @@ glm::vec3 calculateRandomDirectionInHemisphere(
  */
 __host__ __device__
 void scatterRay(
-		PathSegment & pathSegment,
+		PathSegment &pathSegment,
         glm::vec3 intersect,
         glm::vec3 normal,
         const Material &m,
@@ -76,4 +174,60 @@ void scatterRay(
     // TODO: implement this.
     // A basic implementation of pure-diffuse shading will just call the
     // calculateRandomDirectionInHemisphere defined above.
+    glm::vec3 rayDir = glm::normalize(pathSegment.ray.direction);
+    normal = glm::normalize(normal);
+
+    if (m.hasReflective && m.hasRefractive)
+    {
+        // Fresnel dielectric
+        pathSegment.ray.direction = fresnel(rayDir, normal, m.indexOfRefraction,
+            m.color, m.specular.color, pathSegment.color, rng);
+    }
+    else if (m.hasRefractive)
+    {
+        pathSegment.ray.direction = refract(rayDir, normal, m.indexOfRefraction,
+            m.color, m.specular.color, pathSegment.color);
+    }
+    else if (m.hasReflective)
+    {
+        pathSegment.ray.direction = glm::reflect(rayDir, normal);
+        pathSegment.color *= m.specular.color;
+    }
+    else
+    {
+        // Material must be simple diffuse
+        pathSegment.ray.direction = calculateRandomDirectionInHemisphere(normal, rng);
+        pathSegment.color *= m.color;
+    }
+
+    pathSegment.ray.origin = intersect + (.0005f * glm::normalize(pathSegment.ray.direction));
+}
+
+ __device__
+Geom* directRayToLight(
+    PathSegment &pathSegment,
+    glm::vec3 intersect,
+    Geom *lights,
+    int numLights,
+    thrust::default_random_engine &rng) 
+{
+    // choose a random light
+    thrust::uniform_real_distribution<float> lightsDist(0, numLights);
+    int L = floor(lightsDist(rng));
+    Geom *chosenLight = lights + L;
+
+    // sample a point on the area light
+    thrust::uniform_real_distribution<float> u01(0, 1);
+    glm::vec3 p(u01(rng) - 0.5f, 0.0f, u01(rng) - 0.5f);
+    // transform the point from object to world space
+    p = glm::vec3(chosenLight->transform * glm::vec4(p, 1.0f));
+    // set new ray direction
+    pathSegment.ray.direction = glm::normalize(p - intersect);
+
+    // counterbalance the illumination from multiple lights by multiplying the
+    // contribution of the light by the probability it was chosen
+    pathSegment.color /= numLights;
+
+    pathSegment.ray.origin = intersect + (.0005f * glm::normalize(pathSegment.ray.direction));
+    return chosenLight;
 }
diff --git a/src/intersections.h b/src/intersections.h
index 6f23872..5d04dc9 100644
--- a/src/intersections.h
+++ b/src/intersections.h
@@ -142,3 +142,190 @@ __host__ __device__ float sphereIntersectionTest(Geom sphere, Ray r,
 
     return glm::length(r.origin - intersectionPoint);
 }
+
+// SDFs
+// Resources:
+// https://www.iquilezles.org/www/articles/distfunctions/distfunctions.htm
+// http://jamie-wong.com/2016/07/15/ray-marching-signed-distance-functions/
+
+__host__ __device__ glm::mat4 rotateX(float theta) {
+    float c = cos(theta);
+    float s = sin(theta);
+
+    return glm::mat4(
+        glm::vec4(1, 0, 0, 0),
+        glm::vec4(0, c, -s, 0),
+        glm::vec4(0, s, c, 0),
+        glm::vec4(0, 0, 0, 1)
+    );
+}
+
+__host__ __device__ glm::mat4 rotateY(float theta) {
+    float c = cos(theta);
+    float s = sin(theta);
+
+    return glm::mat4(
+        glm::vec4(c, 0, s, 0),
+        glm::vec4(0, 1, 0, 0),
+        glm::vec4(-s, 0, c, 0),
+        glm::vec4(0, 0, 0, 1)
+    );
+}
+
+__host__ __device__ glm::mat4 rotateZ(float theta) {
+    float c = cos(theta);
+    float s = sin(theta);
+
+    return glm::mat4(
+        glm::vec4(c, -s, 0, 0),
+        glm::vec4(s, c, 0, 0),
+        glm::vec4(0, 0, 1, 0),
+        glm::vec4(0, 0, 0, 1)
+    );
+}
+
+__host__ __device__ float sphereSDF(glm::vec3 p) {
+    return glm::length(p) - 0.65;
+}
+
+__host__ __device__ float boxSDF(glm::vec3 p) {
+    glm::vec3 d = glm::abs(p) - glm::vec3(0.5f);
+    return glm::length(glm::max(d, glm::vec3(0.0f)))
+        + min(max(d.x, max(d.y, d.z)), 0.0f);
+}
+
+__host__ __device__ float cylinderSDF(glm::vec3 p) {
+    
+    glm::vec2 d = abs(glm::vec2(glm::length(glm::vec2(p.x, p.z)), p.y)) - glm::vec2(0.35f, 0.75f);
+    return min(max(d.x, d.y), 0.0f) + glm::length(glm::max(d, glm::vec2(0.0f)));
+}
+
+__host__ __device__ float hollowShapeSDF(glm::vec3 p, bool &mat)
+{
+    // union of three cylinders in each axis
+    glm::vec3 pxCyl = glm::vec3((glm::inverse(rotateZ(PI/2.0f)) * glm::vec4(p, 1.0f)));
+    glm::vec3 pzCyl = glm::vec3((glm::inverse(rotateX(PI / 2.0f)) * glm::vec4(p, 1.0f)));
+    float cylPiece = min(cylinderSDF(p), min(cylinderSDF(pxCyl), cylinderSDF(pzCyl)));
+
+    // intersection of cub and sphere
+    float chunkPiece = max(boxSDF(p), sphereSDF(p));
+
+    // setting mat so that we can color the inside and outside of the shape two different materials
+    if (chunkPiece > -cylPiece)
+    {
+        mat = true;
+        return chunkPiece;
+    }
+    else
+    {
+        mat = false;
+        return -cylPiece;
+    }
+    //return max(chunkPiece, -cylPiece);
+}
+
+__host__ __device__ float hollowShapeIntersectionTest(Geom geo, Ray r,
+    glm::vec3 &intersectionPoint, glm::vec3 &normal, bool &outside, bool &mat) {
+       
+    glm::vec3 ro = multiplyMV(geo.inverseTransform, glm::vec4(r.origin, 1.0f));
+    glm::vec3 rd = multiplyMV(geo.inverseTransform, glm::vec4(r.direction, 0.0f));
+    
+    bool junk;
+    // compute t value
+    float t = 0;
+    float dist;
+    outside = hollowShapeSDF(ro, junk) > 0.f;
+    for (int i = 0; i < 100; ++i)
+    {
+        glm::vec3 currPos = rd * t + ro;
+        // find distance to the shape from this point on the ray
+        float dist = outside ? hollowShapeSDF(currPos, mat) : -hollowShapeSDF(currPos, mat);
+        // if we're close enough to the shape we found a good t
+        if (glm::abs(dist) < EPSILON) {
+            break;
+        }
+        // otherwise, we keep marching 
+        t += dist;
+        if (t >= 200) {
+            // we assume this ray didn't hit anything
+            t = -1;
+            break;
+        }
+    }
+
+    glm::vec3 interpt = ro + t * rd;
+    intersectionPoint = multiplyMV(geo.transform, glm::vec4(interpt, 1.0f));
+
+    normal = glm::normalize(glm::vec3(
+        hollowShapeSDF(glm::vec3(interpt.x + EPSILON, interpt.y, interpt.z), junk) -
+        hollowShapeSDF(glm::vec3(interpt.x - EPSILON, interpt.y, interpt.z), junk),
+        hollowShapeSDF(glm::vec3(interpt.x, interpt.y + EPSILON, interpt.z), junk) -
+        hollowShapeSDF(glm::vec3(interpt.x, interpt.y - EPSILON, interpt.z), junk),
+        hollowShapeSDF(glm::vec3(interpt.x, interpt.y, interpt.z + EPSILON), junk) -
+        hollowShapeSDF(glm::vec3(interpt.x, interpt.y, interpt.z - EPSILON), junk)
+    ));
+    if (!outside) normal = -normal;
+    
+    return t;
+}
+
+__host__ __device__ float torusSDF(glm::vec3 p) {
+    glm::vec2 q = glm::vec2(glm::length(glm::vec2(p.x, p.z)) - 0.75, p.y);
+    return glm::length(q) - 0.15;
+}
+
+__host__ __device__ float twist(glm::vec3 p)
+{
+    const float k = 5.0; // or some other amount
+    float c = cos(k*p.y);
+    float s = sin(k*p.y);
+    glm::mat2 m = glm::mat2(c, -s, s, c);
+    glm::vec3 q = glm::vec3(m*glm::vec2(p.x, p.z), p.y);
+    return torusSDF(q);
+}
+
+__host__ __device__ float twistIntersectionTest(Geom geo, Ray r,
+    glm::vec3 &intersectionPoint, glm::vec3 &normal, bool &outside, bool &mat) {
+
+    glm::vec3 ro = multiplyMV(geo.inverseTransform, glm::vec4(r.origin, 1.0f));
+    glm::vec3 rd = multiplyMV(geo.inverseTransform, glm::vec4(r.direction, 0.0f));
+
+    // compute t value
+    float t = 0;
+    float dist;
+    outside = twist(ro) > 0.f;
+    for (int i = 0; i < 100; ++i)
+    {
+        glm::vec3 currPos = rd * t + ro;
+        // find distance to the shape from this point on the ray
+        float dist = outside ? twist(currPos) : -twist(currPos);
+        // if we're close enough to the shape we found a good t
+        if (glm::abs(dist) < EPSILON) {
+            break;
+        }
+        // otherwise, we keep marching 
+        t += dist;
+        if (t >= 200) {
+            // we assume this ray didn't hit anything
+            t = -1;
+            break;
+        }
+    }
+
+    glm::vec3 interpt = ro + t * rd;
+    intersectionPoint = multiplyMV(geo.transform, glm::vec4(interpt, 1.0f));
+
+    normal = glm::normalize(glm::vec3(
+        twist(glm::vec3(interpt.x + EPSILON, interpt.y, interpt.z)) -
+        twist(glm::vec3(interpt.x - EPSILON, interpt.y, interpt.z)),
+        twist(glm::vec3(interpt.x, interpt.y + EPSILON, interpt.z)) -
+        twist(glm::vec3(interpt.x, interpt.y - EPSILON, interpt.z)),
+        twist(glm::vec3(interpt.x, interpt.y, interpt.z + EPSILON)) -
+        twist(glm::vec3(interpt.x, interpt.y, interpt.z - EPSILON))
+    ));
+    if (abs(glm::dot(normal, glm::vec3(0.0f, 0.0f, 1.0f))) < 0.2) mat = false;
+    if (!outside) normal = -normal;
+
+    return t;
+}
+
diff --git a/src/main.cpp b/src/main.cpp
index fe8e85e..3dcdaf6 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -22,6 +22,7 @@ glm::vec3 ogLookAt; // for recentering the camera
 Scene *scene;
 RenderState *renderState;
 int iteration;
+float t;
 
 int width;
 int height;
@@ -101,6 +102,7 @@ void saveImage() {
 void runCuda() {
     if (camchanged) {
         iteration = 0;
+        t = 0.f;
         Camera &cam = renderState->camera;
         cameraPosition.x = zoom * sin(phi) * sin(theta);
         cameraPosition.y = zoom * cos(theta);
@@ -125,6 +127,7 @@ void runCuda() {
     if (iteration == 0) {
         pathtraceFree();
         pathtraceInit(scene);
+        t = 0.f;
     }
 
     if (iteration < renderState->iterations) {
@@ -134,7 +137,7 @@ void runCuda() {
 
         // execute the kernel
         int frame = 0;
-        pathtrace(pbo_dptr, frame, iteration);
+        t += pathtrace(pbo_dptr, frame, iteration);
 
         // unmap buffer object
         cudaGLUnmapBufferObject(pbo);
@@ -144,6 +147,11 @@ void runCuda() {
         cudaDeviceReset();
         exit(EXIT_SUCCESS);
     }
+
+    if (iteration % 10 == 1)
+    {
+        printf("Average up to Iteration %d: %f ms\n", iteration, t/float(iteration));
+    }
 }
 
 void keyCallback(GLFWwindow* window, int key, int scancode, int action, int mods) {
diff --git a/src/pathtrace.cu b/src/pathtrace.cu
index c1ec122..defd58a 100644
--- a/src/pathtrace.cu
+++ b/src/pathtrace.cu
@@ -3,7 +3,9 @@
 #include <cmath>
 #include <thrust/execution_policy.h>
 #include <thrust/random.h>
-#include <thrust/remove.h>
+#include <thrust/device_ptr.h>
+#include <thrust/sort.h>
+#include <thrust/partition.h>
 
 #include "sceneStructs.h"
 #include "scene.h"
@@ -18,6 +20,15 @@
 
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
+
+#define TIME_ITER 1
+
+#define STREAM_COMPACTION 1
+#define SORT_BY_MATERIAL 0
+#define CACHE_FIRST_BOUNCE 0
+
+#define DIRECT_LIGHTING 0
+
 void checkCUDAErrorFn(const char *msg, const char *file, int line) {
 #if ERRORCHECK
     cudaDeviceSynchronize();
@@ -38,6 +49,14 @@ void checkCUDAErrorFn(const char *msg, const char *file, int line) {
 #endif
 }
 
+#if TIME_ITER
+PerformanceTimer& timer()
+{
+    static PerformanceTimer timer;
+    return timer;
+}
+#endif
+
 __host__ __device__
 thrust::default_random_engine makeSeededRandomEngine(int iter, int index, int depth) {
     int h = utilhash((1 << 31) | (depth << 22) | iter) ^ utilhash(index);
@@ -75,6 +94,12 @@ static PathSegment * dev_paths = NULL;
 static ShadeableIntersection * dev_intersections = NULL;
 // TODO: static variables for device memory, any extra info you need, etc
 // ...
+#if CACHE_FIRST_BOUNCE
+static ShadeableIntersection * dev_first_bounce = NULL;
+#endif
+static Geom * dev_lights = NULL;
+static int num_lights = 0;
+static int num_materials = 0;
 
 void pathtraceInit(Scene *scene) {
     hst_scene = scene;
@@ -91,11 +116,31 @@ void pathtraceInit(Scene *scene) {
 
   	cudaMalloc(&dev_materials, scene->materials.size() * sizeof(Material));
   	cudaMemcpy(dev_materials, scene->materials.data(), scene->materials.size() * sizeof(Material), cudaMemcpyHostToDevice);
+    num_materials = scene->materials.size();
 
   	cudaMalloc(&dev_intersections, pixelcount * sizeof(ShadeableIntersection));
   	cudaMemset(dev_intersections, 0, pixelcount * sizeof(ShadeableIntersection));
 
     // TODO: initialize any extra device memeory you need
+    #if CACHE_FIRST_BOUNCE
+    cudaMalloc(&dev_first_bounce, pixelcount * sizeof(ShadeableIntersection));
+    cudaMemset(dev_first_bounce, 0, pixelcount * sizeof(ShadeableIntersection));
+    #endif
+
+    #if DIRECT_LIGHTING
+    std::vector<Geom> lights;
+    for (Geom g : scene->geoms)
+    {
+        if (scene->materials[g.materialid].emittance > 0.0f)
+        {
+            lights.push_back(g);
+        }
+    }
+    num_lights = lights.size();
+
+    cudaMalloc(&dev_lights, num_lights * sizeof(Geom));
+    cudaMemcpy(dev_lights, lights.data(), lights.size() * sizeof(Geom), cudaMemcpyHostToDevice);
+    #endif
 
     checkCUDAError("pathtraceInit");
 }
@@ -107,6 +152,13 @@ void pathtraceFree() {
   	cudaFree(dev_materials);
   	cudaFree(dev_intersections);
     // TODO: clean up any extra device memory you created
+    #if CACHE_FIRST_BOUNCE
+    cudaFree(dev_first_bounce);
+    #endif
+
+    #if DIRECT_LIGHTING
+    cudaFree(dev_lights);
+    #endif
 
     checkCUDAError("pathtraceFree");
 }
@@ -129,20 +181,29 @@ __global__ void generateRayFromCamera(Camera cam, int iter, int traceDepth, Path
 		PathSegment & segment = pathSegments[index];
 
 		segment.ray.origin = cam.position;
-    segment.color = glm::vec3(1.0f, 1.0f, 1.0f);
+        segment.color = glm::vec3(1.0f, 1.0f, 1.0f);
+
+        #if CACHE_FIRST_BOUNCE
+        segment.ray.direction = glm::normalize(cam.view
+            - cam.right * cam.pixelLength.x * ((float)x - (float)cam.resolution.x * 0.5f)
+            - cam.up * cam.pixelLength.y * ((float)y - (float)cam.resolution.y * 0.5f)
+        );
+        #else
+        // Antialiasing by jittering the ray
+        thrust::default_random_engine rng = makeSeededRandomEngine(iter, x + y * cam.resolution.x, 0);
+        thrust::uniform_real_distribution<float> jitter(-0.5, 0.5);
 
-		// TODO: implement antialiasing by jittering the ray
 		segment.ray.direction = glm::normalize(cam.view
-			- cam.right * cam.pixelLength.x * ((float)x - (float)cam.resolution.x * 0.5f)
-			- cam.up * cam.pixelLength.y * ((float)y - (float)cam.resolution.y * 0.5f)
+			- cam.right * cam.pixelLength.x * ((float)x - (float)cam.resolution.x * 0.5f + jitter(rng))
+			- cam.up * cam.pixelLength.y * ((float)y - (float)cam.resolution.y * 0.5f + jitter(rng))
 			);
+        #endif
 
 		segment.pixelIndex = index;
 		segment.remainingBounces = traceDepth;
 	}
 }
 
-// TODO:
 // computeIntersections handles generating ray intersections ONLY.
 // Generating new rays is handled in your shader(s).
 // Feel free to modify the code below.
@@ -153,6 +214,7 @@ __global__ void computeIntersections(
 	, Geom * geoms
 	, int geoms_size
 	, ShadeableIntersection * intersections
+    , int num_materials
 	)
 {
 	int path_index = blockIdx.x * blockDim.x + threadIdx.x;
@@ -172,7 +234,7 @@ __global__ void computeIntersections(
 		glm::vec3 tmp_normal;
 
 		// naive parse through global geoms
-
+        bool changeMat = false;
 		for (int i = 0; i < geoms_size; i++)
 		{
 			Geom & geom = geoms[i];
@@ -186,6 +248,18 @@ __global__ void computeIntersections(
 				t = sphereIntersectionTest(geom, pathSegment.ray, tmp_intersect, tmp_normal, outside);
 			}
 			// TODO: add more intersection tests here... triangle? metaball? CSG?
+            else if (geom.type == HOLLOW)
+            {
+                bool mat = true;
+                t = hollowShapeIntersectionTest(geom, pathSegment.ray, tmp_intersect, tmp_normal, outside, mat);
+                if (!mat) { changeMat = true; }
+            }
+            else if (geom.type == TWIST)
+            {
+                bool mat = true;
+                t = twistIntersectionTest(geom, pathSegment.ray, tmp_intersect, tmp_normal, outside, mat);
+                if (!mat) { changeMat = true; }
+            }
 
 			// Compute the minimum t from the intersection tests to determine what
 			// scene geometry object was hit first.
@@ -207,6 +281,12 @@ __global__ void computeIntersections(
 			//The ray hits something
 			intersections[path_index].t = t_min;
 			intersections[path_index].materialId = geoms[hit_geom_index].materialid;
+            if (changeMat && (geoms[hit_geom_index].type == HOLLOW || geoms[hit_geom_index].type == TWIST))
+            {
+                int mID = geoms[hit_geom_index].materialid + 1;
+                if (mID >= num_materials) { mID = 0; }
+                intersections[path_index].materialId = mID;
+            }
 			intersections[path_index].surfaceNormal = normal;
 		}
 	}
@@ -249,7 +329,6 @@ __global__ void shadeFakeMaterial (
       }
       // Otherwise, do some pseudo-lighting computation. This is actually more
       // like what you would expect from shading in a rasterizer like OpenGL.
-      // TODO: replace this! you should be able to start with basically a one-liner
       else {
         float lightTerm = glm::dot(intersection.surfaceNormal, glm::vec3(0.0f, 1.0f, 0.0f));
         pathSegments[idx].color *= (materialColor * lightTerm) * 0.3f + ((1.0f - intersection.t * 0.02f) * materialColor) * 0.7f;
@@ -265,6 +344,130 @@ __global__ void shadeFakeMaterial (
   }
 }
 
+__global__ void shadeScene(
+    int iter
+    , int num_paths
+    , ShadeableIntersection * shadeableIntersections
+    , PathSegment * pathSegments
+    , Material * materials
+)
+{
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < num_paths)
+    {
+        ShadeableIntersection intersection = shadeableIntersections[idx];
+        if (pathSegments[idx].remainingBounces <= 0) { return; }
+
+        if (intersection.t > 0.0f) { // if the intersection exists...
+            thrust::default_random_engine rng = makeSeededRandomEngine(iter, idx, 0);
+
+            Material material = materials[intersection.materialId];
+            glm::vec3 materialColor = material.color;
+
+            // If the material indicates that the object was a light, "light" the ray
+            if (material.emittance > 0.0f) {
+                pathSegments[idx].color *= (materialColor * material.emittance);
+                pathSegments[idx].remainingBounces = 0;
+            }
+            else {
+                glm::vec3 intersecPoint = getPointOnRay(pathSegments[idx].ray, intersection.t);
+                scatterRay(pathSegments[idx], intersecPoint, intersection.surfaceNormal, material, rng);
+                pathSegments[idx].remainingBounces--;
+            }
+        }
+        else {
+            // If there was no intersection, color the ray black.
+            // Lots of renderers use 4 channel color, RGBA, where A = alpha, often
+            // used for opacity, in which case they can indicate "no opacity".
+            // This can be useful for post-processing and image compositing.
+            pathSegments[idx].color = glm::vec3(0.0f);
+            pathSegments[idx].remainingBounces = 0;
+        }
+    }
+}
+
+__global__ void shadeSceneDirectLighting(
+    int iter
+    , int num_paths
+    , ShadeableIntersection * shadeableIntersections
+    , PathSegment * pathSegments
+    , Material * materials
+    , Geom * lights
+    , int num_lights
+)
+{
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < num_paths)
+    {
+        ShadeableIntersection intersection = shadeableIntersections[idx];
+        if (pathSegments[idx].remainingBounces <= 0) { return; }
+
+        if (intersection.t > 0.0f && num_lights > 0) { // if the intersection exists...
+            thrust::default_random_engine rng = makeSeededRandomEngine(iter, idx, 0);
+
+            Material material = materials[intersection.materialId];
+            glm::vec3 materialColor = material.color;
+
+            Geom* chosenLight = nullptr;
+
+            // If the material indicates that the object was a light, "light" the ray
+            if (material.emittance > 0.0f) {
+                float dir = glm::dot(intersection.surfaceNormal, glm::vec3(0, -1, 0));
+                if (chosenLight)
+                {
+                    if (intersection.materialId == chosenLight->materialid && dir > 0) {
+                        pathSegments[idx].color *= (materialColor * material.emittance);
+                    }
+                    else {
+                        pathSegments[idx].color *= glm::vec3(0.0f);
+                    }
+                }
+                else {
+                    pathSegments[idx].color *= (materialColor * material.emittance);
+                }
+                pathSegments[idx].remainingBounces = 0;
+            }
+            else if (pathSegments[idx].remainingBounces == 1)
+            {
+                // this should be the ray to the light so if we didnt hit the light
+                // it means the pixel is in shadow
+                pathSegments[idx].color *= glm::vec3(0.0f);
+                pathSegments[idx].remainingBounces = 0;
+            }
+            else {
+                glm::vec3 intersecPoint = getPointOnRay(pathSegments[idx].ray, intersection.t);
+
+                // compute f for this material
+                if (material.hasReflective || material.hasRefractive){
+                    // in direct lighting we dont get reflective and refractive properties
+                    pathSegments[idx].color *= glm::vec3(0.0f);
+                }
+                else{
+                    // get the color of the diffuse surface
+                    pathSegments[idx].color *= materialColor;
+                }
+
+                // Sets ray direction toward some position on some light in the scene
+                chosenLight = directRayToLight(pathSegments[idx], intersecPoint, lights, num_lights, rng);
+
+                // add in absDot
+                pathSegments[idx].color *= glm::abs(glm::dot(pathSegments[idx].ray.direction, intersection.surfaceNormal));
+
+                // next ray will be to light
+                pathSegments[idx].remainingBounces = 1;
+            }
+        }
+        else {
+            // If there was no intersection, color the ray black.
+            // Lots of renderers use 4 channel color, RGBA, where A = alpha, often
+            // used for opacity, in which case they can indicate "no opacity".
+            // This can be useful for post-processing and image compositing.
+            pathSegments[idx].color = glm::vec3(0.0f);
+            pathSegments[idx].remainingBounces = 0;
+        }
+    }
+}
+
 // Add the current iteration's output to the overall image
 __global__ void finalGather(int nPaths, glm::vec3 * image, PathSegment * iterationPaths)
 {
@@ -277,11 +480,30 @@ __global__ void finalGather(int nPaths, glm::vec3 * image, PathSegment * iterati
 	}
 }
 
+struct sortByMat
+{
+    __host__ __device__
+    bool operator()(const ShadeableIntersection &i1, 
+            const ShadeableIntersection &i2)
+    {
+        return i1.materialId < i2.materialId;
+    }
+};
+
+struct isTerm
+{
+    __host__ __device__
+    bool operator()(const PathSegment &ps)
+    {
+        return ps.remainingBounces > 0;
+    }
+};
+
 /**
  * Wrapper for the __global__ call that sets up the kernel calls and does a ton
  * of memory management
  */
-void pathtrace(uchar4 *pbo, int frame, int iter) {
+float pathtrace(uchar4 *pbo, int frame, int iter) {
     const int traceDepth = hst_scene->state.traceDepth;
     const Camera &cam = hst_scene->state.camera;
     const int pixelcount = cam.resolution.x * cam.resolution.y;
@@ -295,6 +517,11 @@ void pathtrace(uchar4 *pbo, int frame, int iter) {
 	// 1D block for path tracing
 	const int blockSize1d = 128;
 
+    #if SORT_BY_MATERIAL
+    thrust::device_ptr<PathSegment> dev_thrust_paths(dev_paths);
+    thrust::device_ptr<ShadeableIntersection> dev_thrust_intersections(dev_intersections);
+    #endif
+
     ///////////////////////////////////////////////////////////////////////////
 
     // Recap:
@@ -310,12 +537,12 @@ void pathtrace(uchar4 *pbo, int frame, int iter) {
     //     Currently, intersection distance is recorded as a parametric distance,
     //     t, or a "distance along the ray." t = -1.0 indicates no intersection.
     //     * Color is attenuated (multiplied) by reflections off of any object
-    //   * TODO: Stream compact away all of the terminated paths.
+    //   * Stream compact away all of the terminated paths.
     //     You may use either your implementation or `thrust::remove_if` or its
     //     cousins.
     //     * Note that you can't really use a 2D kernel launch any more - switch
     //       to 1D.
-    //   * TODO: Shade the rays that intersected something or didn't bottom out.
+    //   * Shade the rays that intersected something or didn't bottom out.
     //     That is, color the ray by performing a color computation according
     //     to the shader, then generate a new ray to continue the ray path.
     //     We recommend just updating the ray's PathSegment in place.
@@ -324,7 +551,9 @@ void pathtrace(uchar4 *pbo, int frame, int iter) {
     // * Finally, add this iteration's results to the image. This has been done
     //   for you.
 
-    // TODO: perform one iteration of path tracing
+    #if TIME_ITER
+    timer().startCpuTimer();
+    #endif
 
 	generateRayFromCamera <<<blocksPerGrid2d, blockSize2d >>>(cam, iter, traceDepth, dev_paths);
 	checkCUDAError("generate camera ray");
@@ -339,45 +568,100 @@ void pathtrace(uchar4 *pbo, int frame, int iter) {
   bool iterationComplete = false;
 	while (!iterationComplete) {
 
-	// clean shading chunks
-	cudaMemset(dev_intersections, 0, pixelcount * sizeof(ShadeableIntersection));
-
-	// tracing
-	dim3 numblocksPathSegmentTracing = (num_paths + blockSize1d - 1) / blockSize1d;
-	computeIntersections <<<numblocksPathSegmentTracing, blockSize1d>>> (
-		depth
-		, num_paths
-		, dev_paths
-		, dev_geoms
-		, hst_scene->geoms.size()
-		, dev_intersections
-		);
-	checkCUDAError("trace one bounce");
-	cudaDeviceSynchronize();
-	depth++;
-
-
-	// TODO:
-	// --- Shading Stage ---
-	// Shade path segments based on intersections and generate new rays by
-  // evaluating the BSDF.
-  // Start off with just a big kernel that handles all the different
-  // materials you have in the scenefile.
-  // TODO: compare between directly shading the path segments and shading
-  // path segments that have been reshuffled to be contiguous in memory.
-
-  shadeFakeMaterial<<<numblocksPathSegmentTracing, blockSize1d>>> (
-    iter,
-    num_paths,
-    dev_intersections,
-    dev_paths,
-    dev_materials
-  );
-  iterationComplete = true; // TODO: should be based off stream compaction results.
+	    // clean shading chunks
+	    cudaMemset(dev_intersections, 0, pixelcount * sizeof(ShadeableIntersection));
+
+        // tracing
+        dim3 numblocksPathSegmentTracing = (num_paths + blockSize1d - 1) / blockSize1d;
+        #if CACHE_FIRST_BOUNCE
+        if (iter == 1 || depth > 0)
+        {
+            computeIntersections << <numblocksPathSegmentTracing, blockSize1d >> > (
+                depth
+                , num_paths
+                , dev_paths
+                , dev_geoms
+                , hst_scene->geoms.size()
+                , dev_intersections,
+                num_materials);
+            checkCUDAError("trace one bounce");
+        }
+        else
+        {
+            cudaMemcpy(dev_intersections, dev_first_bounce, num_paths * sizeof(ShadeableIntersection), cudaMemcpyDeviceToDevice);
+        }
+        if (iter == 1 && depth == 0)
+        {
+            cudaMemcpy(dev_first_bounce, dev_intersections, num_paths * sizeof(ShadeableIntersection), cudaMemcpyDeviceToDevice);
+        }
+        #else
+	    computeIntersections <<<numblocksPathSegmentTracing, blockSize1d>>> (
+		    depth
+		    , num_paths
+		    , dev_paths
+		    , dev_geoms
+		    , hst_scene->geoms.size()
+		    , dev_intersections
+            , num_materials);
+	    checkCUDAError("trace one bounce");
+        #endif
+
+        cudaDeviceSynchronize();
+        depth++;
+
+        #if SORT_BY_MATERIAL 
+        // Sort paths and intersections based on their material id to make 
+        // computations with similar conditional branching contiguous in memory
+        thrust::sort_by_key(dev_thrust_intersections, 
+            dev_thrust_intersections + num_paths, 
+            dev_thrust_paths,
+            sortByMat());
+        #endif
+
+        // --- Shading Stage ---
+        // Shade path segments based on intersections and generate new rays by
+        // evaluating the BSDF.
+        // Start off with just a big kernel that handles all the different
+        // materials you have in the scenefile.
+        #if DIRECT_LIGHTING 
+        shadeSceneDirectLighting << <numblocksPathSegmentTracing, blockSize1d >> > (
+            iter,
+            num_paths,
+            dev_intersections,
+            dev_paths,
+            dev_materials,
+            dev_lights,
+            num_lights);
+        checkCUDAError("shadeScene");
+        #else
+        shadeScene<<<numblocksPathSegmentTracing, blockSize1d>>> (
+            iter,
+            num_paths,
+            dev_intersections,
+            dev_paths,
+            dev_materials);
+        checkCUDAError("shadeScene");
+        #endif
+
+        #if STREAM_COMPACTION 
+        // Stream compaction
+        PathSegment* new_end = thrust::partition(thrust::device, dev_paths, dev_paths + num_paths, isTerm());
+        num_paths = new_end - dev_paths;
+
+        iterationComplete = (depth == traceDepth || num_paths == 0);
+        #else   
+        iterationComplete = (depth == traceDepth);
+        #endif
 	}
 
-  // Assemble this iteration and apply it to the image
-  dim3 numBlocksPixels = (pixelcount + blockSize1d - 1) / blockSize1d;
+    #if TIME_ITER
+        timer().endCpuTimer();
+    #endif
+
+    num_paths = dev_path_end - dev_paths;
+
+    // Assemble this iteration and apply it to the image
+    dim3 numBlocksPixels = (pixelcount + blockSize1d - 1) / blockSize1d;
 	finalGather<<<numBlocksPixels, blockSize1d>>>(num_paths, dev_image, dev_paths);
 
     ///////////////////////////////////////////////////////////////////////////
@@ -390,4 +674,9 @@ void pathtrace(uchar4 *pbo, int frame, int iter) {
             pixelcount * sizeof(glm::vec3), cudaMemcpyDeviceToHost);
 
     checkCUDAError("pathtrace");
+    #if TIME_ITER
+        return timer().getCpuElapsedTimeForPreviousOperation();
+    #else
+        return -1.f
+    #endif
 }
diff --git a/src/pathtrace.h b/src/pathtrace.h
index 1241227..ef380c3 100644
--- a/src/pathtrace.h
+++ b/src/pathtrace.h
@@ -5,4 +5,4 @@
 
 void pathtraceInit(Scene *scene);
 void pathtraceFree();
-void pathtrace(uchar4 *pbo, int frame, int iteration);
+float pathtrace(uchar4 *pbo, int frame, int iteration); //returns time
diff --git a/src/scene.cpp b/src/scene.cpp
index cbae043..ed3f1d3 100644
--- a/src/scene.cpp
+++ b/src/scene.cpp
@@ -51,6 +51,12 @@ int Scene::loadGeom(string objectid) {
             } else if (strcmp(line.c_str(), "cube") == 0) {
                 cout << "Creating new cube..." << endl;
                 newGeom.type = CUBE;
+            } else if (strcmp(line.c_str(), "hollowShape") == 0) {
+                cout << "Creating new hollow shape..." << endl;
+                newGeom.type = HOLLOW;
+            } else if (strcmp(line.c_str(), "twist") == 0) {
+                cout << "Creating new twist shape..." << endl;
+                newGeom.type = TWIST;
             }
         }
 
diff --git a/src/sceneStructs.h b/src/sceneStructs.h
index b38b820..6f49eb1 100644
--- a/src/sceneStructs.h
+++ b/src/sceneStructs.h
@@ -10,6 +10,8 @@
 enum GeomType {
     SPHERE,
     CUBE,
+    HOLLOW,
+    TWIST,
 };
 
 struct Ray {
diff --git a/src/utilities.h b/src/utilities.h
index abb4f27..b33e550 100644
--- a/src/utilities.h
+++ b/src/utilities.h
@@ -9,6 +9,10 @@
 #include <string>
 #include <vector>
 
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <chrono>
+
 #define PI                3.1415926535897932384626422832795028841971f
 #define TWO_PI            6.2831853071795864769252867665590057683943f
 #define SQRT_OF_ONE_THIRD 0.5773502691896257645091487805019574556476f
@@ -24,3 +28,95 @@ namespace utilityCore {
     extern std::string convertIntToString(int number);
     extern std::istream& safeGetline(std::istream& is, std::string& t); //Thanks to http://stackoverflow.com/a/6089413
 }
+
+/**
+* This class is used for timing the performance
+* Uncopyable and unmovable
+*
+* Adapted from WindyDarian(https://github.com/WindyDarian)
+*/
+class PerformanceTimer
+{
+public:
+    PerformanceTimer()
+    {
+        cudaEventCreate(&event_start);
+        cudaEventCreate(&event_end);
+    }
+
+    ~PerformanceTimer()
+    {
+        cudaEventDestroy(event_start);
+        cudaEventDestroy(event_end);
+    }
+
+    void startCpuTimer()
+    {
+        if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); }
+        cpu_timer_started = true;
+
+        time_start_cpu = std::chrono::high_resolution_clock::now();
+    }
+
+    void endCpuTimer()
+    {
+        time_end_cpu = std::chrono::high_resolution_clock::now();
+
+        if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); }
+
+        std::chrono::duration<double, std::milli> duro = time_end_cpu - time_start_cpu;
+        prev_elapsed_time_cpu_milliseconds =
+            static_cast<decltype(prev_elapsed_time_cpu_milliseconds)>(duro.count());
+
+        cpu_timer_started = false;
+    }
+
+    void startGpuTimer()
+    {
+        if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); }
+        gpu_timer_started = true;
+
+        cudaEventRecord(event_start);
+    }
+
+    void endGpuTimer()
+    {
+        cudaEventRecord(event_end);
+        cudaEventSynchronize(event_end);
+
+        if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); }
+
+        cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end);
+        gpu_timer_started = false;
+    }
+
+    float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015
+    {
+        return prev_elapsed_time_cpu_milliseconds;
+    }
+
+    float getGpuElapsedTimeForPreviousOperation() //noexcept
+    {
+        return prev_elapsed_time_gpu_milliseconds;
+    }
+
+    // remove copy and move functions
+    PerformanceTimer(const PerformanceTimer&) = delete;
+    PerformanceTimer(PerformanceTimer&&) = delete;
+    PerformanceTimer& operator=(const PerformanceTimer&) = delete;
+    PerformanceTimer& operator=(PerformanceTimer&&) = delete;
+
+private:
+    cudaEvent_t event_start = nullptr;
+    cudaEvent_t event_end = nullptr;
+
+    using time_point_t = std::chrono::high_resolution_clock::time_point;
+    time_point_t time_start_cpu;
+    time_point_t time_end_cpu;
+
+    bool cpu_timer_started = false;
+    bool gpu_timer_started = false;
+
+    float prev_elapsed_time_cpu_milliseconds = 0.f;
+    float prev_elapsed_time_gpu_milliseconds = 0.f;
+};
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
index 4538f04..6444fc7 100644
--- a/stream_compaction/CMakeLists.txt
+++ b/stream_compaction/CMakeLists.txt
@@ -1,6 +1,17 @@
 set(SOURCE_FILES
+    "common.h"
+    "common.cu"
+    "cpu.h"
+    "cpu.cu"
+    "naive.h"
+    "naive.cu"
+    "efficient.h"
+    "efficient.cu"
+    "thrust.h"
+    "thrust.cu"
     )
 
 cuda_add_library(stream_compaction
     ${SOURCE_FILES}
+    OPTIONS -arch=sm_37
     )
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
new file mode 100644
index 0000000..2ed6d63
--- /dev/null
+++ b/stream_compaction/common.cu
@@ -0,0 +1,39 @@
+#include "common.h"
+
+void checkCUDAErrorFn(const char *msg, const char *file, int line) {
+    cudaError_t err = cudaGetLastError();
+    if (cudaSuccess == err) {
+        return;
+    }
+
+    fprintf(stderr, "CUDA error");
+    if (file) {
+        fprintf(stderr, " (%s:%d)", file, line);
+    }
+    fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
+    exit(EXIT_FAILURE);
+}
+
+
+namespace StreamCompaction {
+    namespace Common {
+
+        /**
+         * Maps an array to an array of 0s and 1s for stream compaction. Elements
+         * which map to 0 will be removed, and elements which map to 1 will be kept.
+         */
+        __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
+            // TODO
+        }
+
+        /**
+         * Performs scatter on an array. That is, for each element in idata,
+         * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
+         */
+        __global__ void kernScatter(int n, int *odata,
+                const int *idata, const int *bools, const int *indices) {
+            // TODO
+        }
+
+    }
+}
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
new file mode 100644
index 0000000..996997e
--- /dev/null
+++ b/stream_compaction/common.h
@@ -0,0 +1,132 @@
+#pragma once
+
+#include <cuda.h>
+#include <cuda_runtime.h>
+
+#include <cstdio>
+#include <cstring>
+#include <cmath>
+#include <algorithm>
+#include <chrono>
+#include <stdexcept>
+
+#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
+#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
+
+/**
+ * Check for CUDA errors; print and exit if there was a problem.
+ */
+void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1);
+
+inline int ilog2(int x) {
+    int lg = 0;
+    while (x >>= 1) {
+        ++lg;
+    }
+    return lg;
+}
+
+inline int ilog2ceil(int x) {
+    return x == 1 ? 0 : ilog2(x - 1) + 1;
+}
+
+namespace StreamCompaction {
+    namespace Common {
+        __global__ void kernMapToBoolean(int n, int *bools, const int *idata);
+
+        __global__ void kernScatter(int n, int *odata,
+                const int *idata, const int *bools, const int *indices);
+
+	    /**
+	    * This class is used for timing the performance
+	    * Uncopyable and unmovable
+        *
+        * Adapted from WindyDarian(https://github.com/WindyDarian)
+	    */
+	    class PerformanceTimer
+	    {
+	    public:
+		    PerformanceTimer()
+		    {
+			    cudaEventCreate(&event_start);
+			    cudaEventCreate(&event_end);
+		    }
+
+		    ~PerformanceTimer()
+		    {
+			    cudaEventDestroy(event_start);
+			    cudaEventDestroy(event_end);
+		    }
+
+		    void startCpuTimer()
+		    {
+			    if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); }
+			    cpu_timer_started = true;
+
+			    time_start_cpu = std::chrono::high_resolution_clock::now();
+		    }
+
+		    void endCpuTimer()
+		    {
+			    time_end_cpu = std::chrono::high_resolution_clock::now();
+
+			    if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); }
+
+			    std::chrono::duration<double, std::milli> duro = time_end_cpu - time_start_cpu;
+			    prev_elapsed_time_cpu_milliseconds =
+				    static_cast<decltype(prev_elapsed_time_cpu_milliseconds)>(duro.count());
+
+			    cpu_timer_started = false;
+		    }
+
+		    void startGpuTimer()
+		    {
+			    if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); }
+			    gpu_timer_started = true;
+
+			    cudaEventRecord(event_start);
+		    }
+
+		    void endGpuTimer()
+		    {
+			    cudaEventRecord(event_end);
+			    cudaEventSynchronize(event_end);
+
+			    if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); }
+
+			    cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end);
+			    gpu_timer_started = false;
+		    }
+
+		    float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015
+		    {
+			    return prev_elapsed_time_cpu_milliseconds;
+		    }
+
+		    float getGpuElapsedTimeForPreviousOperation() //noexcept
+		    {
+			    return prev_elapsed_time_gpu_milliseconds;
+		    }
+
+		    // remove copy and move functions
+		    PerformanceTimer(const PerformanceTimer&) = delete;
+		    PerformanceTimer(PerformanceTimer&&) = delete;
+		    PerformanceTimer& operator=(const PerformanceTimer&) = delete;
+		    PerformanceTimer& operator=(PerformanceTimer&&) = delete;
+
+	    private:
+		    cudaEvent_t event_start = nullptr;
+		    cudaEvent_t event_end = nullptr;
+
+		    using time_point_t = std::chrono::high_resolution_clock::time_point;
+		    time_point_t time_start_cpu;
+		    time_point_t time_end_cpu;
+
+		    bool cpu_timer_started = false;
+		    bool gpu_timer_started = false;
+
+		    float prev_elapsed_time_cpu_milliseconds = 0.f;
+		    float prev_elapsed_time_gpu_milliseconds = 0.f;
+	    };
+    }
+}
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
new file mode 100644
index 0000000..dac6329
--- /dev/null
+++ b/stream_compaction/cpu.cu
@@ -0,0 +1,91 @@
+#include <cstdio>
+#include "cpu.h"
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace CPU {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+	        static PerformanceTimer timer;
+	        return timer;
+        }
+
+        /**
+         * CPU scan (prefix sum).
+         * For performance analysis, this is supposed to be a simple for loop.
+         * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
+         */
+        void scan(int n, int *odata, const int *idata) {
+            bool standalone = true;
+            try { timer().startCpuTimer(); }
+            catch (std::exception) { standalone = false; }
+
+            int sum = 0;
+            for (int i = 0; i < n; i++)
+            {
+                odata[i] = sum;
+                sum += idata[i];
+            }
+
+	        if(standalone){ timer().endCpuTimer(); }
+        }
+
+        /**
+         * CPU stream compaction without using the scan function.
+         *
+         * @returns the number of elements remaining after compaction.
+         */
+        int compactWithoutScan(int n, int *odata, const int *idata) {
+	        timer().startCpuTimer();
+            
+            int idxInOut = 0;
+            for (int i = 0; i < n; i++)
+            {
+                if (idata[i] != 0)
+                {
+                    odata[idxInOut] = idata[i];
+                    idxInOut++;
+                }
+            }
+
+	        timer().endCpuTimer();
+            return idxInOut;
+        }
+
+        /**
+         * CPU stream compaction using scan and scatter, like the parallel version.
+         *
+         * @returns the number of elements remaining after compaction.
+         */
+        int compactWithScan(int n, int *odata, const int *idata) {
+	        timer().startCpuTimer();
+	        
+            int* temp = new int[n];
+            int* tempScan = new int[n];
+            for (int i = 0; i < n; i++)
+            {
+                if (idata[i] != 0)
+                    temp[i] = 1;
+                else
+                    temp[i] = 0;
+            }
+
+            scan(n, tempScan, temp);
+
+            int num = 0;
+            for (int i = 0; i < n; i++)
+            {
+                if (temp[i] == 1)
+                {
+                    odata[tempScan[i]] = idata[i];
+                    num++;
+                }
+            }
+
+	        timer().endCpuTimer();
+            return num;
+        }
+    }
+}
diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h
new file mode 100644
index 0000000..236ce11
--- /dev/null
+++ b/stream_compaction/cpu.h
@@ -0,0 +1,15 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace CPU {
+	    StreamCompaction::Common::PerformanceTimer& timer();
+
+        void scan(int n, int *odata, const int *idata);
+
+        int compactWithoutScan(int n, int *odata, const int *idata);
+
+        int compactWithScan(int n, int *odata, const int *idata);
+    }
+}
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
new file mode 100644
index 0000000..8e090a8
--- /dev/null
+++ b/stream_compaction/efficient.cu
@@ -0,0 +1,183 @@
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include "common.h"
+#include "efficient.h"
+
+namespace StreamCompaction {
+    namespace Efficient {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+
+        #define blockSize 128
+
+        __global__ void kernMapToBoolean(int N, int* arr, int* boolArr)
+        {
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index >= N) return;
+
+            boolArr[index] = arr[index];
+            if (boolArr[index] != 0)
+            {
+                boolArr[index] = 1;
+            }
+        }
+
+        __global__ void kernUpSweep(int N, int d, int* arr)
+        {
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index >= N) return;
+
+            int powDPlus = 1 << (d+1);
+            int powD = 1 << d;
+
+            if (index % powDPlus == 0)
+            {
+                arr[index + powDPlus - 1] += arr[index + powD - 1];
+            }
+            if (index == N - 1)
+            {
+                arr[index] = 0;
+            }
+        }
+
+        __global__ void kernDownSweep(int N, int d, int* arr)
+        {
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index >= N) return;
+
+            int powDPlus = 1 << (d + 1);
+            int powD = 1 << d;
+
+            if (index % powDPlus == 0)
+            {
+                int temp = arr[index + powD - 1];
+                arr[index + powD - 1] = arr[index + powDPlus - 1];
+                arr[index + powDPlus - 1] += temp;
+            }
+        }
+
+        __global__ void kernInclusiveToExclusive(int N, int* arr)
+        {
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index >= N) return;
+
+            arr[index] -= arr[0];
+        }
+
+        __global__ void kernScatter(int N, int* idata, int* boolArr, int* scanArr, int* odata)
+        {
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index >= N) return;
+
+            if (boolArr[index] == 1)
+            {
+                int idx = scanArr[index];
+                odata[idx] = idata[index];
+            }
+        }
+
+        /**
+         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+         */
+        void scan(int n, int *odata, const int *idata) {
+            int pow2Length = 1 << ilog2ceil(n);
+            int* idataPow2 = new int[pow2Length];
+            memcpy(idataPow2, idata, n * sizeof(int));
+            for (int i = n; i < pow2Length; i++)
+            {
+                idataPow2[i] = 0;
+            }
+
+            int* dev_arr;
+
+            cudaMalloc((void**)&dev_arr, pow2Length * sizeof(int));
+            cudaMemcpy(dev_arr, idataPow2, sizeof(int) * n, cudaMemcpyHostToDevice);
+            
+            timer().startGpuTimer();
+            
+            for (int i = 0; i < ilog2ceil(n); i++)
+            {
+                kernUpSweep << <pow2Length, blockSize >> > (pow2Length, i, dev_arr);
+            }
+            
+            for (int j = ilog2ceil(n)-1; j >= 0; j--)
+            {
+                kernDownSweep << <pow2Length, blockSize >> > (pow2Length, j, dev_arr);
+            }
+
+            kernInclusiveToExclusive << <pow2Length, blockSize >> > (pow2Length, dev_arr);
+
+            timer().endGpuTimer();
+
+            cudaMemcpy(odata, dev_arr, sizeof(int) * n, cudaMemcpyDeviceToHost);
+
+            cudaFree(dev_arr);
+        }
+
+        /**
+         * Performs stream compaction on idata, storing the result into odata.
+         * All zeroes are discarded.
+         *
+         * @param n      The number of elements in idata.
+         * @param odata  The array into which to store elements.
+         * @param idata  The array of elements to compact.
+         * @returns      The number of elements remaining after compaction.
+         */
+        int compact(int n, int *odata, const int *idata) {
+            int* dev_boolArr;
+            int* dev_scanArr;
+            int* dev_idata;
+            int* dev_odata;
+
+            int* host_boolArr = new int[n];
+            int* host_scanArr = new int[n];
+
+            cudaMalloc((void**)&dev_boolArr, n * sizeof(int));
+            cudaMalloc((void**)&dev_scanArr, n * sizeof(int));
+            cudaMalloc((void**)&dev_idata, n * sizeof(int));
+            cudaMalloc((void**)&dev_odata, n * sizeof(int));
+
+            cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+            
+            bool standalone = true;
+            try { timer().startCpuTimer(); }
+            catch (std::exception) { standalone = false; }
+            
+            kernMapToBoolean << <n, blockSize >> > (n, dev_idata, dev_boolArr);
+            cudaMemcpy(host_boolArr, dev_boolArr, sizeof(int) * n, cudaMemcpyDeviceToHost);
+
+            scan(n, host_scanArr, host_boolArr);
+            cudaMemcpy(dev_scanArr, host_scanArr, sizeof(int) * n, cudaMemcpyHostToDevice);
+
+            kernScatter << <n, blockSize >> > (n, dev_idata, dev_boolArr, dev_scanArr, dev_odata);
+
+            if (standalone) { timer().endCpuTimer(); }
+
+            cudaMemcpy(odata, dev_odata, sizeof(int) * n, cudaMemcpyDeviceToHost);
+
+            int num = 0;
+            for (int i = n-1; i >= 0; i--)
+            {
+                if (host_boolArr[i] != 0)
+                {
+                    num = host_scanArr[i] + 1;
+                    break;
+                }
+            }
+
+            cudaFree(dev_boolArr);
+            cudaFree(dev_scanArr);
+            cudaFree(dev_idata);
+            cudaFree(dev_odata);
+
+            free(host_boolArr);
+            free(host_scanArr);
+
+            return num;
+        }
+    }
+}
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
new file mode 100644
index 0000000..803cb4f
--- /dev/null
+++ b/stream_compaction/efficient.h
@@ -0,0 +1,13 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace Efficient {
+        StreamCompaction::Common::PerformanceTimer& timer();
+
+        void scan(int n, int *odata, const int *idata);
+
+        int compact(int n, int *odata, const int *idata);
+    }
+}
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
new file mode 100644
index 0000000..f42ecc3
--- /dev/null
+++ b/stream_compaction/naive.cu
@@ -0,0 +1,77 @@
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include "common.h"
+#include "naive.h"
+
+namespace StreamCompaction {
+    namespace Naive {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+
+        #define blockSize 128
+
+        __global__ void kernNaiveScan(int N, int d, int* read, int* write)
+        {
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index >= N) return;
+
+            int start = pow(float(2), float(d - 1));
+            if (index >= start)
+            {
+                write[index] = read[index - start] + read[index];
+            }
+            else
+            {
+                write[index] = read[index];
+            }
+        }
+
+        __global__ void kernInclusiveToExclusive(int N, int* read, int* write)
+        {
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index >= N) return;
+
+            if (index == 0)
+            {
+                write[index] = 0;
+            }
+            else
+            {
+                write[index] = read[index-1];
+            }
+        }
+
+        /**
+         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+         */
+        void scan(int n, int *odata, const int *idata) {
+            int* dev_read;
+            int* dev_write;
+            
+            cudaMalloc((void**)&dev_read, n * sizeof(int));
+            cudaMalloc((void**)&dev_write, n * sizeof(int));
+
+            cudaMemcpy(dev_read, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+            
+            timer().startGpuTimer();
+            
+            for (int i = 1; i <= ilog2ceil(n); i++)
+            {
+                kernNaiveScan << <n, blockSize >> > (n, i, dev_read, dev_write);
+                std::swap(dev_read, dev_write);
+            }
+            kernInclusiveToExclusive << <n, blockSize >> > (n, dev_read, dev_write);
+   
+            timer().endGpuTimer();
+
+            cudaMemcpy(odata, dev_write, sizeof(int) * n, cudaMemcpyDeviceToHost);
+
+            cudaFree(dev_read);
+            cudaFree(dev_write);
+        }
+    }
+}
diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h
new file mode 100644
index 0000000..37dcb06
--- /dev/null
+++ b/stream_compaction/naive.h
@@ -0,0 +1,11 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace Naive {
+        StreamCompaction::Common::PerformanceTimer& timer();
+
+        void scan(int n, int *odata, const int *idata);
+    }
+}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
new file mode 100644
index 0000000..6183414
--- /dev/null
+++ b/stream_compaction/thrust.cu
@@ -0,0 +1,35 @@
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <thrust/device_vector.h>
+#include <thrust/host_vector.h>
+#include <thrust/scan.h>
+#include "common.h"
+#include "thrust.h"
+
+namespace StreamCompaction {
+    namespace Thrust {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+        /**
+         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+         */
+        void scan(int n, int *odata, const int *idata) {
+            thrust::host_vector<int> host_idata(n);
+            thrust::copy(idata, idata + n, host_idata.begin());
+
+            thrust::device_vector<int> dv_in = host_idata;
+
+            thrust::device_vector<int> dv_out(n);
+            
+            timer().startGpuTimer();
+            thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+            timer().endGpuTimer();
+
+            thrust::copy(dv_out.begin(), dv_out.end(), odata);
+        }
+    }
+}
diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h
new file mode 100644
index 0000000..fe98206
--- /dev/null
+++ b/stream_compaction/thrust.h
@@ -0,0 +1,11 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace Thrust {
+        StreamCompaction::Common::PerformanceTimer& timer();
+
+        void scan(int n, int *odata, const int *idata);
+    }
+}