diff --git a/README.md b/README.md
index d63a6a1..598fb88 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,68 @@
+![](images/project1.gif)
+
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* SOMANSHU AGARWAL
+  * [LinkedIn](https://www.linkedin.com/in/somanshu25)
+* Tested on: Windows 10, i7-6700 @ 3.4GHz 16GB, Quadro P 100 12 233MB (Moore 100B Lab)
+
+### OBJECTIVE
+
+The main motive of the project is to visualize Boids flocking and do the performance analysis on three implementation ways of flocking: naive, scattered and coherent search. The search algorithms find the relevant neigbours for each boid and update its position and velocity with respect to the neighbours through the three rules: adhesion, dodging and cohesion. The brief implemetation details of these 3 ways and the performance analysis is given below.
+
+## 1. Naive Implementation
+
+In the naive implementation, for a particluar boid, we are searching all the other boids and then selecting the relevant neighbours for the position and velocity update of that boid. Thus for every boid, we are checking other N-1 boids, which is time consuming and inefficient when N is fairly large. 
+
+## 2. Scattered 
+
+### Grid System
+For the rest of the 2 implementations, we are creating a grid system in which we are enclosing the whole space in a cube with grid cell width setting by the user according to the requirements. With changing the gird cell width, the grid resoltion and the grid cell count will vary and its impact on the performance is also studied in this project. By using the grid system, we are labelling each boid to a particular grid cell index.
+
+![](https://github.com/somanshu25/Project1-CUDA-Flocking/blob/master/images/Boids%20Ugrid%20base.png)
+
+In this implementation, we are limiting our search for the boids which are labelled in the cells of the neighbourhood distance we need to check. Thus, we are checking those grid cells which could be enclosed in the sphere inside the cube with the radius as the neighbouring distance for the boid. The number of cells enclosed would differ with the cell width length of the grid cells. Having grid width of twice the maximum neighbourhood distance will take maximum 8 cells while the length equal to maximum neighbouring distance will take maximum 27 cells. For making sure we are selecting those cells which are enclosed in the grid cells, we are sorting the array of boid indexes with respect to the grid indexes labelled and then when we are storing the start and end boid index for a particular grid cell. Thus, our search for the true neighbours are limited to the range of boid indexes in the start and end indexes rather than the whole N-1. 
+
+![](https://github.com/somanshu25/Project1-CUDA-Flocking/blob/master/images/Boids%20Ugrids%20buffers%20naive.png)
+
+## 3. Coherent
+
+In this implememntation, we do further optimization in our code by reshuffling the position and velocity boid data with the sorted nboid particluar array indexes so that the memory access of the boid data is also contigous and there will be more cache hits and less misses, which would save the runtime. Thus, we only need the start and end indexes of the boids which are present in the cell to access the boid data rather than the middleman boid array indexes that was required in our previous scattered implementation.
+
+![](https://github.com/somanshu25/Project1-CUDA-Flocking/blob/master/images/Boids%20Ugrids%20buffers%20data%20coherent.png)
+
+# Performance Analysis:
+
+Here are the graphs showing the frame rate per second of the three implementations when we change the number of boids in the siimulation: (Note that I'm taking the FPS ratings from the window title where the display is shown)
+
+1 .The graphs showing the FPS with and without visualization for increasing the number of Boids in simulation for all the 3 implementations are:
+
+![](images/Chart_Boid_Count_With_Visual.png)
+
+
+![](images/Chart_Boid_Count_Without_Visual.png)
+
+We can see from both the above graphs that FPS rates are more witout visualization. We observe that the simulation runs well for 10K boids for naive, 100K for scattered and 200K for coherent, assuming the value of FPS for which simulations runs decent till 20-30 FPS. 
+2. The graphs showing increasing the block size for all the 3 implementations are (taking number of boids to be 50K):
+
+![](images/Chart_Block_Size.png)
+
+
+3. The graph for observing the grid cell width is 2*(neighbouring distance) vs when grid cell width is neighbouring distance is shown below, where the neighbouring distance is the maximum of all the ruleDistances for the neighbour checks.
+
+![](images/Chart_Grid_CellWidth_Change.png)
+
+## Answers to the questions:
+### For each implementation, how does changing the number of boids affect performance? Why do you think this is?
+We can see in the graphs above that as the number of boids increase, the performance will decrease as there would be more number of neighbours and boids to check for earch reference boid and thus will increase the runtime. The impact can be seen same in all the 3 implementations, we can observe that the perofrmance is bettern in coherent for higher number of boids as compared to rest of the 2 implementations.
+
+### For each implementation, how does changing the block count and block size affect performance? Why do you think this is?
+With the graph of increasing the block size for all the 3 implementations, we can see that the performance is almost same if we are using the block sioze equal to the multiples of warp size, i.e., 32. We can infer with the fact that the numebr of threads that are released are same even if the block size are increasing.
+
+### For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?
+For coherent uniform grid, we see the improved performance as wew are removing the extra runtime where we are looking for the velocity and the position of the boid data in the memory which is not in sync with the start and end indexes of the grid cell, making more misses in the cache present in on-chip. After rehuffling the velocity and position vectors according to the boids present in the cell as contigous, we spend less time in memory access which improves our performance.
 
-### (TODO: Your README)
+### Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!
+After plotting the line graph for 8 vs 27 neighbouring cells for scattered implementation, we observe that the 27 neighbouring cells is giving slightly better performance as compared to 8 neighbouring cells. One reason could be the smaller cell width will result in more number of grid cells which will include less number of boids as the cells are smaller. Thus, as compare to larger grid cells, it would check the boids which are more probably wthin the neighbouring distance of the boid as the larger cells would have to search the boids which could be farther but just indide the cells. Hence, the performance is slightly better with smaller grid width.
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
diff --git a/images/Chart_Block_Size.png b/images/Chart_Block_Size.png
new file mode 100644
index 0000000..abf306e
Binary files /dev/null and b/images/Chart_Block_Size.png differ
diff --git a/images/Chart_Boid_Count_With_Visual.png b/images/Chart_Boid_Count_With_Visual.png
new file mode 100644
index 0000000..c06c354
Binary files /dev/null and b/images/Chart_Boid_Count_With_Visual.png differ
diff --git a/images/Chart_Boid_Count_Without_Visual.png b/images/Chart_Boid_Count_Without_Visual.png
new file mode 100644
index 0000000..6576c15
Binary files /dev/null and b/images/Chart_Boid_Count_Without_Visual.png differ
diff --git a/images/Chart_Grid_CellWidth_Change.png b/images/Chart_Grid_CellWidth_Change.png
new file mode 100644
index 0000000..e752e89
Binary files /dev/null and b/images/Chart_Grid_CellWidth_Change.png differ
diff --git a/images/project1.gif b/images/project1.gif
new file mode 100644
index 0000000..3aec82d
Binary files /dev/null and b/images/project1.gif differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..53df9f4 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -85,6 +85,8 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_posCoherent;
+glm::vec3 *dev_velCoherent;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -165,10 +167,31 @@ void Boids::initSimulation(int N) {
   gridInverseCellWidth = 1.0f / gridCellWidth;
   float halfGridWidth = gridCellWidth * halfSideCount;
   gridMinimum.x -= halfGridWidth;
-  gridMinimum.y -= halfGridWidth;
+	gridMinimum.y -= halfGridWidth;
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+	dev_thrust_particleArrayIndices = thrust::device_pointer_cast<int>(dev_particleArrayIndices);
+	dev_thrust_particleGridIndices = thrust::device_pointer_cast<int>(dev_particleGridIndices);
+
+	cudaMalloc((void**)&dev_posCoherent, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+	cudaMalloc((void**)&dev_velCoherent, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
   cudaDeviceSynchronize();
 }
 
@@ -233,7 +256,48 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
   // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
   // Rule 2: boids try to stay a distance d away from each other
   // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+  glm::vec3 updatedVel(0.0f, 0.0f, 0.0f);
+  
+  // Rule 1
+  glm::vec3 vectPerceived(0.0f, 0.0f, 0.0f);
+	int numberNeighbours = 0;
+  for (int i = 0; i < N; i++) {
+    //float dist = sqrt(pow(pos[iSelf].x - pos[i].x,2) + pow(pos[iSelf].y - pos[i].y,2) + pow(pos[iSelf].z - pos[i].z,2));
+		if (iSelf != i && glm::distance(pos[iSelf], pos[i]) < rule1Distance) {
+			vectPerceived += pos[i];
+			numberNeighbours++;
+		}
+  }
+	if (numberNeighbours) {
+		vectPerceived /= numberNeighbours;
+		updatedVel = updatedVel + (vectPerceived - pos[iSelf])*rule1Scale;
+	}
+  
+  // Rule 2
+  glm::vec3 vectC(0.0f, 0.0f, 0.0f);
+  for (int i = 0; i < N; i++) {
+    //float dist = sqrt(pow(pos[iSelf].x - pos[i].x,2) + pow(pos[iSelf].y - pos[i].y,2) + pow(pos[iSelf].z - pos[i].z,2));
+		if (iSelf != i && glm::distance(pos[iSelf], pos[i]) < rule2Distance) {
+			vectC -= (pos[i] - pos[iSelf]);
+		}
+  }
+  updatedVel += vectC*rule2Scale;
+  //printf("Here it came \n");
+  // Rule 3
+  glm::vec3 vectPerceivedVel(0.0f, 0.0f, 0.0f);
+	numberNeighbours = 0;
+  for (int i = 0; i < N; i++) {
+    //float dist = sqrt(pow(pos[iSelf].x - pos[i].x,2) + pow(pos[iSelf].y - pos[i].y,2) + pow(pos[iSelf].z - pos[i].z,2));
+		if (iSelf != i && glm::distance(pos[iSelf], pos[i]) < rule3Distance) {
+			vectPerceivedVel += vel[i];
+			numberNeighbours++;
+		}
+  }
+	if (numberNeighbours) {
+		vectPerceivedVel /= numberNeighbours;
+		updatedVel = updatedVel + (vectPerceivedVel)*rule3Scale;
+	}
+  return updatedVel;
 }
 
 /**
@@ -245,6 +309,17 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   // Compute a new velocity based on pos and vel1
   // Clamp the speed
   // Record the new velocity into vel2. Question: why NOT vel1?
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+	  return;
+  }
+  glm::vec3 updatedVel{ computeVelocityChange(N, index, pos, vel1) };
+  vel2[index] = vel1[index] + updatedVel;
+  //float length = glm::length(vel2[index]);
+  //if (length > maxSpeed) {
+	//  vel2[index] = maxSpeed / length * vel2[index];
+  //}
+  vel2[index] = glm::clamp(vel2[index], -maxSpeed, maxSpeed);
 }
 
 /**
@@ -289,6 +364,15 @@ __global__ void kernComputeIndices(int N, int gridResolution,
     // - Label each boid with the index of its grid cell.
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+	int index = threadIdx.x + blockIdx.x * blockDim.x;
+	if (index >= N)
+		return;
+
+	indices[index] = index;
+	glm::vec3 gridCoord = glm::floor((pos[index] - gridMin) * inverseCellWidth);
+	int index1D = gridIndex3Dto1D((int) gridCoord.x, (int) gridCoord.y, (int) gridCoord.z, gridResolution);
+	//printf("Indices are: %d \n", index1D);
+	gridIndices[index] = index1D;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -306,6 +390,49 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+	int index = threadIdx.x + blockIdx.x * blockDim.x;
+	if (index >= N)
+		return;
+
+	if (index == 0)
+		gridCellStartIndices[particleGridIndices[index]] = 0;
+
+	else if (index == N-1)
+		gridCellEndIndices[particleGridIndices[index]] = N-1;
+
+	else if (particleGridIndices[index - 1] != particleGridIndices[index]) {
+		gridCellStartIndices[particleGridIndices[index]] = index;
+		gridCellEndIndices[particleGridIndices[index-1]] = index-1;
+	}
+
+}
+
+// New function for rehuffling the boid data for contiguos allocation for boids within the cell
+
+__global__ void kernReshuffleBoidDataForCoherentSearch(
+	int N, int *particleArrayIndices,
+	glm::vec3 *pos, glm::vec3 *vel,
+	glm::vec3 *posShuffled, glm::vec3 *velShuffled) {
+	// Reshuffle pos and vel for continoug memeory access
+
+	int index = threadIdx.x + blockIdx.x * blockDim.x;
+	if (index >= N)
+		return;
+
+	posShuffled[index] = pos[particleArrayIndices[index]];
+	velShuffled[index] = vel[particleArrayIndices[index]];
+}
+
+__global__ void kernUnshuffleBoidVelocityData(
+	int N, int *particleArrayIndices, 
+	glm::vec3 *velShuffled, glm::vec3 *velUnshuffled) {
+	// Reshuffle pos and vel for continoug memeory access
+
+	int index = threadIdx.x + blockIdx.x * blockDim.x;
+	if (index >= N)
+		return;
+
+	velUnshuffled[particleArrayIndices[index]] = velShuffled[index];
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -322,6 +449,82 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+
+	/*for (int i = 0; i < N; i++) {
+		printf("%d\n", particleArrayIndices[i]);
+	}*/
+
+	int index = threadIdx.x + blockIdx.x * blockDim.x;
+	if (index >= N)
+		return;
+
+	glm::vec3 vectPerceived(0.0f, 0.0f, 0.0f);
+	glm::vec3 vectC(0.0f, 0.0f, 0.0f);
+	glm::vec3 vectPerceivedVel(0.0f, 0.0f, 0.0f);
+	
+	float max_distance = max(rule1Distance, max(rule2Distance, rule3Distance));
+	glm::ivec3 maxCellIndex = ((pos[index] - gridMin) + max_distance) * inverseCellWidth;
+	glm::ivec3 minCellIndex = ((pos[index] - gridMin) - max_distance) * inverseCellWidth;
+	
+	int candidateGridId,boidStart,boidEnd,numberNeighboursRule1 = 0,numberNeighboursRule3 = 0;
+	//printf("Boid Co-ordinate: x: %d ,y: %d, z:%d \n", boidCoord.x, boidCoord.y, boidCoord.z);
+	for (int x = minCellIndex.x; x <= maxCellIndex.x ; x++) {
+		if (x < 0 || x >= gridResolution)
+			continue;
+		for (int y = minCellIndex.y; y <= maxCellIndex.y; y++) {
+			if (y < 0 || y >= gridResolution)
+				continue;
+			for (int z = minCellIndex.z; z <= maxCellIndex.z; z++) {
+				if (z < 0 || z >= gridResolution)
+					continue;
+
+				candidateGridId = gridIndex3Dto1D(x,y,z, gridResolution);
+				boidStart = gridCellStartIndices[candidateGridId];
+				boidEnd = gridCellEndIndices[candidateGridId];
+				//printf("Boid Start: %d and Boid End: %d \n", boidStart, boidEnd);
+				if (boidStart == -1 || boidEnd == -1)
+					continue;
+				for (int id = boidStart ; id<= boidEnd ; id++) {
+					int boidCheckId = particleArrayIndices[id];
+					//loopCount++;
+					// Rule 1
+					if (index != boidCheckId && glm::distance(pos[boidCheckId], pos[index]) < rule1Distance) {
+						vectPerceived += pos[boidCheckId];
+						numberNeighboursRule1++;
+					}
+					// glm::distance(pos[boidCheckId], pos[index])
+					// Rule 2
+					if (index != boidCheckId && glm::distance(pos[boidCheckId], pos[index]) < rule2Distance) {
+						vectC -= (pos[boidCheckId] - pos[index]);
+					}
+
+					// Rule 3
+					if (index != boidCheckId && glm::distance(pos[boidCheckId], pos[index]) < rule3Distance) {
+						vectPerceivedVel += vel1[boidCheckId];
+						numberNeighboursRule3++;
+					}
+				}
+			}
+		}
+	}
+	//printf("Loop Count : %d\n", loopCount);
+	
+	if (numberNeighboursRule1) {
+		vectPerceived /= numberNeighboursRule1;
+		vectPerceived = (vectPerceived - pos[index])*rule1Scale;
+	}
+	
+	vectC = vectC*rule2Scale;
+
+	if (numberNeighboursRule3) {
+		vectPerceivedVel /= numberNeighboursRule3;
+		vectPerceivedVel = vectPerceivedVel*rule3Scale;
+	}
+
+	vel2[index] = vel1[index] + vectPerceived + vectC + vectPerceivedVel;
+	vel2[index] = glm::clamp(vel2[index],-maxSpeed, maxSpeed);
+	//vel2[index] = vel1[index] + vectVelChange;
+
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -341,6 +544,78 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+
+	int index = threadIdx.x + blockIdx.x * blockDim.x;
+	if (index >= N)
+		return;
+
+	glm::vec3 vectPerceived(0.0f, 0.0f, 0.0f);
+	glm::vec3 vectC(0.0f, 0.0f, 0.0f);
+	glm::vec3 vectPerceivedVel(0.0f, 0.0f, 0.0f);
+
+	float max_distance = max(rule1Distance, max(rule2Distance, rule3Distance));
+	glm::ivec3 maxCellIndex = ((pos[index] - gridMin) + max_distance) * inverseCellWidth;
+	glm::ivec3 minCellIndex = ((pos[index] - gridMin) - max_distance) * inverseCellWidth;
+
+	int candidateGridId, boidStart, boidEnd, numberNeighboursRule1 = 0, numberNeighboursRule3 = 0;
+	//printf("Boid Co-ordinate: x: %d ,y: %d, z:%d \n", maxCellIndex.x, maxCellIndex.y, maxCellIndex.z);
+	int loopCount = 0;
+	for (int x = minCellIndex.x; x <= maxCellIndex.x; x++) {
+		if (x < 0 || x >= gridResolution)
+			continue;
+		for (int y = minCellIndex.y; y <= maxCellIndex.y; y++) {
+			if (y < 0 || y >= gridResolution)
+				continue;
+			for (int z = minCellIndex.z; z <= maxCellIndex.z; z++) {
+				if (z < 0 || z >= gridResolution)
+					continue;
+				loopCount++;
+				candidateGridId = gridIndex3Dto1D(x, y, z, gridResolution);
+				boidStart = gridCellStartIndices[candidateGridId];
+				boidEnd = gridCellEndIndices[candidateGridId];
+				//printf("Boid Start: %d and Boid End: %d \n", boidStart, boidEnd);
+				if (boidStart == -1 || boidEnd == -1)
+					continue;
+				for (int id = boidStart; id <= boidEnd; id++) {
+					//loopCount++;
+					// Rule 1
+					if (index != id && glm::distance(pos[id], pos[index]) < rule1Distance) {
+						vectPerceived += pos[id];
+						numberNeighboursRule1++;
+					}
+					// glm::distance(pos[boidCheckId], pos[index])
+					// Rule 2
+					if (index != id && glm::distance(pos[id], pos[index]) < rule2Distance) {
+						vectC -= (pos[id] - pos[index]);
+					}
+
+					// Rule 3
+					if (index != id && glm::distance(pos[id], pos[index]) < rule3Distance) {
+						vectPerceivedVel += vel1[id];
+						numberNeighboursRule3++;
+					}
+				}
+			}
+		}
+	}
+	//printf("Loop Count : %d\n", loopCount);
+
+	if (numberNeighboursRule1) {
+		vectPerceived /= numberNeighboursRule1;
+		vectPerceived = (vectPerceived - pos[index])*rule1Scale;
+	}
+
+	vectC = vectC * rule2Scale;
+
+	if (numberNeighboursRule3) {
+		vectPerceivedVel /= numberNeighboursRule3;
+		vectPerceivedVel = vectPerceivedVel * rule3Scale;
+	}
+
+	vel2[index] = vel1[index] + vectPerceived + vectC + vectPerceivedVel;
+	vel2[index] = glm::clamp(vel2[index], -maxSpeed, maxSpeed);
+	//vel2[index] = vel1[index] + vectVelChange;
+
 }
 
 /**
@@ -349,6 +624,12 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
   // TODO-1.2 ping-pong the velocity buffers
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernUpdateVelocityBruteForce << <fullBlocksPerGrid,blockSize>> > (numObjects,dev_pos,dev_vel1,dev_vel2);
+	checkCUDAErrorWithLine("brute force failed");
+	kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects,dt,dev_pos,dev_vel2);
+	checkCUDAErrorWithLine("update Position Function Failed");
+	dev_vel1 = dev_vel2;
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
@@ -364,6 +645,31 @@ void Boids::stepSimulationScatteredGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed
+
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	dim3 gridResetSize((gridCellCount + blockSize - 1) / blockSize);
+
+	kernResetIntBuffer << <gridResetSize, blockSize >> > (gridCellCount, dev_gridCellStartIndices,-1);
+	checkCUDAErrorWithLine("resetting dev_gridCellStartIndices failed ");
+
+	kernResetIntBuffer << <gridResetSize, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+	checkCUDAErrorWithLine("resetting dev_gridCellEndIndices failed ");
+
+	kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects,gridSideCount,gridMinimum,gridInverseCellWidth,dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+	checkCUDAErrorWithLine("Computing indices failed ");
+
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+	checkCUDAErrorWithLine("Identifying Start and End Boid Points failed ");
+
+	kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> > (numObjects,gridSideCount,gridMinimum,gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices,dev_pos,dev_vel1,dev_vel2);
+	checkCUDAErrorWithLine("Scattered Neighbour Search failed ");
+
+	kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+	checkCUDAErrorWithLine("update Position Function Failed");
+
+	dev_vel1 = dev_vel2;
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
@@ -382,6 +688,38 @@ void Boids::stepSimulationCoherentGrid(float dt) {
   // - Perform velocity updates using neighbor search
   // - Update positions
   // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	dim3 gridResetSize((gridCellCount + blockSize - 1) / blockSize);
+
+	kernResetIntBuffer << <gridResetSize, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+	checkCUDAErrorWithLine("resetting dev_gridCellStartIndices failed ");
+
+	kernResetIntBuffer << <gridResetSize, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+	checkCUDAErrorWithLine("resetting dev_gridCellEndIndices failed ");
+
+	kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+	checkCUDAErrorWithLine("Computing indices failed ");
+
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+	checkCUDAErrorWithLine("Identifying Start and End Boid Points failed ");
+
+	kernReshuffleBoidDataForCoherentSearch << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices,dev_pos,dev_vel1,dev_posCoherent,dev_velCoherent);
+	checkCUDAErrorWithLine("Shuffling Boid pos and vel data for Coherent Search failed ");
+
+	kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_posCoherent, dev_velCoherent, dev_vel2);
+	checkCUDAErrorWithLine("Coherent Neighbour search failed ");
+
+	kernUnshuffleBoidVelocityData << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices,dev_vel2, dev_vel1);
+	checkCUDAErrorWithLine("Unshuffling velocity boid data failed failed ");
+
+	kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel1);
+	checkCUDAErrorWithLine("update Position Function Failed");
+
+	//dev_vel1 = dev_velCoherent;
 }
 
 void Boids::endSimulation() {
@@ -390,6 +728,14 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+
+	cudaFree(dev_particleArrayIndices);
+	cudaFree(dev_particleGridIndices);
+	cudaFree(dev_gridCellStartIndices);
+	cudaFree(dev_gridCellEndIndices);
+
+	cudaFree(dev_posCoherent);
+	cudaFree(dev_velCoherent);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..fafcc7b 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,11 +14,11 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 50000;
 const float DT = 0.2f;
 
 /**