Finding-silimar-users-in-Netflix-Dataset

The motive is to find the similar users who have rated highly almost same movies. The given databse is a Netflix userbase with movies , user_id, the rating of the movie and the date when the user rated.

In this project, we are cleaning up the Netflix database to process the data easily according to the movies rated by the suers and then find the similar users who have rated the similar movies and list all those pairs. The project is further extended to identify the closest neighbor for a new user entered as recommendation.

Problem 1:

Statement: Clean the Data according to the given constraints and rearrange the data into ‘M’ movie rows and ‘N’ user columns.

Approach:

Drop the unnecessary columns such as date from the data
Remove users with ratings less than 3 and who have rated more than 20 movies
Extract the movie_id values and movie_id indixes using the value of 9999 in rating as a reference
Re-index the data after extracting the movie_id and extract the unique user id from the data
Remove the movies with no users
Generate row, column, data values to generate the sparse matrix
Create the sparse matrix using the row index, column index and data

Results:

Sparse Matrix dimensions: [4485, 231424]

Problem 2:

Statement: To analyze random pairs of 10,000 users and calculate the average Jaccard distance, similarity and plot the histogram of pairwise Jaccard distances

Approach:

Pad the matrix with zeros to increase the row size to the nearest prime number i:e 4493
Randomly choose 10,000 pairs from the data set following the discreet uniform distribution
Calculate the Jaccard distance and similarity of each pair
Jaccard Distance = 1 – intersection(Pair1, Pair2) / union(Pair1, Pair2)
Jaccard Similarity = intersection(Pair1, Pair2) / union(Pair1, Pair2)
Plot the histogram of the values for each pair

Results:

Average value of Jaccard Distance: 0.9804398083185061 Minimum value of Jaccard Distance: 0.33333333333333337

Problem 3:

Statement: Find the nearest neighbor pairs for each queried user

Approach:

We use the method of Min Hashing and Locality Sensitive Hashing to reduce the computational and spatial limits while computing the nearest neighbors

Min Hashing:

Permute the indexes of each row to new index with respect to the function: permuted_index = (a*x + b) % r a = random value in range(0 to (Rows of Padded _Sparse Matrix – 1)) b = random value in range(0 to (Rows of Padded _Sparse Matrix – 1)) r = number of rows of Padded _Sparse Matrix x = vector of index value of 1’s in each column
Hash Value of each column is the minimum index of 1’s in that column
Hence, we get one Hash value for each column of the Sparse Matrix
Repeat the Min Hash process for 100 iterations to obtain a table of Hash values of size [1000, 231424]

Local Sensitivity Hashing:

Split the Hash Table into ‘B’ bands each consisting of r1 rows.
Each band is mapped to a single value according to the function: (a1*x + b1) % r1 a1 = random value in range(0 to (Rows of Padded _Sparse Matrix – 1)) b1 = random value in range(0 to (Rows of Padded _Sparse Matrix – 1)) r1 = Huge integer value x = vector of the elements in each band across the columns
The sum of each vector band is taken and the size of the Hash table is reduced to the size of [B, 231424]

Results:

Original Dataset Sample:

array([[1, 1, 1, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 1], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

Hash Table Sample:

array([[20,74, 1795, ..., 2577, 2577, 3709], [302, 1084, 2618, ..., 4174, 4174, 1830], [ 16, 180, 2530, ..., 1015, 1015, 18], ..., [ 705, 150, 1742, ..., 4393, 4393, 2300], [ 471, 18, 12, ..., 675, 675, 658], [ 76, 76, 76, ..., 732, 732, 2175]])

Band Matrix Sample:

array([[1236891226, 2854582886, 2192001821, ..., 2254508321, 2254508321,1696931309], [2513191099, 1689525747,2544410733, ..., 1905784720, 1905784720,1286734634], .....

Problem 4:

Statement: To detect the pairs of similar users Approach:

Each column of the Band matrix is checked with other columns (each user with the rest) and a user is similar if any one of the band values matches at that position.
Similar pairs are matched to each other and they are mapped to the same list if they have Jaccard distance of less than 0.35
The Jaccard distance of similar pairs are calculated using the sparse matrix and not the Band Matrix

Results:

The value of Band(b1) = 20 and rows per band r1 = 5 give better results as we get less false negatives The probability of finding similar pairs is given by P = 1 – (1 – (S ^ r1)) ^ b1 S = similarity > = 0.65 The values of r1 and b give better results with probability 0.91513 This can also be viewed in the graph. The specific values of r1 and b1 give a smaller number of false negative as compared to the other values.

Problem 5:

Statement: Given a new user find the similar user in the dataset Approach:

The new user dataset is assumed to be in the following format of ‘ratings’ of size [4485, 1]
If the movie was rated by the user, 1 is placed at the location of the movie_id or else a 0 is in that position.
```
          movie_id1[0 or 1] | movie_id2[0 or 1] | movie_id3[0 or 1] |.... |....|movie_idn 
```
We also assume that the user has rated all is movies as 3 or above and only rated movies less than 20
Min hashing and local sensitivity hashing are applied to the new user data and the Band matrix is obtained.
The values of a, b, r, a1, b1, r1 used are the same as the signature matrix and the band matrix of the data
The band values are checked against the users in the data set and if at least one bad match, the new user and the user in that column of the data set are similar
Then we determine the Jaccard distance to determine if the distance between them is less than 0.35
If the distance is less than 0.35 then the new user and the user in the dataset are similar.

Results:

Input of new user user_new_check[0] = 1, user_new_check[189] = 1, user_new_check[455] = 1, user_new_check[466] = 1, user_new_check[561] = 1, user_new_check[690] = 1, user_new_check[1103] = 1, user_new_check[1212] = 1, user_new_check[1896] = 1, user_new_check[3948] = 1, user_new_check[4195] = 1, user_new_check[4375] = 1

Output:

The closest neighbour is: 124105

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Finding-silimar-users-in-Netflix-Dataset

Problem 1:

Statement: Clean the Data according to the given constraints and rearrange the data into ‘M’ movie rows and ‘N’ user columns.

Approach:

Results:

Problem 2:

Statement: To analyze random pairs of 10,000 users and calculate the average Jaccard distance, similarity and plot the histogram of pairwise Jaccard distances

Approach:

Results:

Problem 3:

Statement: Find the nearest neighbor pairs for each queried user

Approach:

Min Hashing:

Local Sensitivity Hashing:

Results:

Original Dataset Sample:

Hash Table Sample:

Band Matrix Sample:

Problem 4:

Statement: To detect the pairs of similar users Approach:

Results:

Problem 5:

Statement: Given a new user find the similar user in the dataset Approach:

Results:

Output:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Finding-silimar-users-in-Netflix-Dataset

Problem 1:

Statement: Clean the Data according to the given constraints and rearrange the data into ‘M’ movie rows and ‘N’ user columns.

Approach:

Results:

Problem 2:

Statement: To analyze random pairs of 10,000 users and calculate the average Jaccard distance, similarity and plot the histogram of pairwise Jaccard distances

Approach:

Results:

Problem 3:

Statement: Find the nearest neighbor pairs for each queried user

Approach:

Min Hashing:

Local Sensitivity Hashing:

Results:

Original Dataset Sample:

Hash Table Sample:

Band Matrix Sample:

Problem 4:

Statement: To detect the pairs of similar users Approach:

Results:

Problem 5:

Statement: Given a new user find the similar user in the dataset Approach:

Results:

Output: