Merge pull request #19 from emirkmo/collab_filter

Add Course 3 notes and code
emirkmo · Nov 19, 2023 · 28d0a9b · 28d0a9b · github-actions · Nov 19, 2023
2 parents 0991b9c + dbeabc1
commit 28d0a9b
Show file tree

Hide file tree

Showing 5 changed files with 177 additions and 1 deletion.
diff --git a/Course3/Notes/collab_filter.md b/Course3/Notes/collab_filter.md
@@ -0,0 +1,12 @@
+# Collaborative Filtering Algorithm
+
+Learn both feature vector X and user parameter (linear regression) vector W and b.
+Users (samples) that rated, i.e. have a parameter for a given feature, are kept track of in
+an binar matrix R. Matrix Y are the ratings. Features X and parameters W and b must be learned collaboratively.
+
+Features = X
+User pars = w, b
+R = mapping between users and movie ratings
+Y = movie ratings
+
+Y(movie, user) = R(movie, user) * (w(user) . x(movie) + b(user))
diff --git a/Course3/Notes/content_filter.md b/Course3/Notes/content_filter.md
@@ -0,0 +1,119 @@
+# Content based filtering
+
+## Difference to collaborative filtering
+
+Learning to match features instead of learning
+from parameters on features.
+
+So users have features, movies have features,
+create vector for each feature set, predict user/movie
+rating match. (Recommend movie to user or predict user score for movie).
+
+No constant vector `b`.
+
+`V_M . V_U`. Must calculate from feature vector.
+
+### How to calculate V? Use deep learning (neural network NN)
+
+NN output layer should not have single unit, but many
+(unit per vector element) HOW MANY?? (idk, 32). Hidden layers can be any complexity, but output layers of `V_M`` and `V_U` must match!
+
+Instead of dot product, simply take sigmoid etc. of
+V_U and V_M, and find where g(V.V) = 1.
+
+## Cost Function
+
+```Latex
+J = Sum (v_u(j) . v_m(i) - y(i,j)) + NN regularization.
+```
+
+Basically need labels Y, with existing movie/user ratings(matches).
+Same cost function for NN for both vectors.
+
+### Tips
+
+To Find: Similar movies take L2 norm distance.
+This can and should be pre-computed!
+Now you have a similarity matrix. Movies are related like
+a graph.
+
+NN benefit realized: Allows easily integrating movie and
+user NN by taking dot product of outer layer of each.
+Really powerful!
+
+The feature engineering is critical.
+
+Algorithm as described is computational expensive to run,
+need modifications to scale.
+
+## Scale up Recommender system
+
+Retrieval & Ranking
+
+### Retrieval
+
+Generate large list of plausible item candidates.
+
+Use pre-computed `||Vm(k) - V_m(j) || ^2`
+
+Find similar movies, most viewed 3 genres, top movies of
+all times, top X movies in same country, etc.
+
+### Ranking
+
+Now we have small list of movies, rank them.
+V_m can be pre-computed (since new users and user
+feature values change way more often).
+We only need to calculate V_u from pared retrieval
+step, which is fast. Can be done on edge.
+
+Retrieval step should be tuned using offline experiments
+and A/B testing, etc.
+
+## Ethics
+
+Don't be evil. Don't be naive.
+Think about goal. Think about bad actors.
+
+Be transparent with users. Need to be careful with exploitative recommendations.
+
+## Tensorflow Recommender Algorithm
+
+Same as NN, Sequential model from keras
+
+```Python
+import tensorflow as tf
+
+user_nn = tf.keras.models.Sequential([tf.keras.layers.Dense(..., activation='relu'), ...])
+...
+
+# add input layer
+user_input = tf.keras.layers.Input(shape=(num_user_features))
+
+vu = user_nn(input_user)
+vu = tf.linalg.l2_normalize(vu, axis=1) # normalize the L2 norm, Yo!
+# Repeat for item/movie
+vm = ...
+
+# Keras dot product layer
+tf.keras.layers.Dot(axes=1)([vu, vm])
+
+# Use simple MSE for loss
+cost_fn = tf.keras.losses.MeanSquaredError # I guess, idk.
+optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
+
+# Training model using keras api.
+n_iterations = 30
+model = tf.keras.Model([input_user, input_item], output)
+model.compile(optimizer=optimizer, loss=cost_fn)
+model.fit([user_train, item_train], y_train, epochs=n_iterations)
+```
+
+### Lab
+
+Using sklearn StandardScaler for user but MinMaxScaler for target. Not clear why. Uses `inverse_transform` of scaler to get back originals. Ready-made `test_train_split` for the split with a 20% test.
+
+Based on the fact that test loss is similar to training
+loss, we infer that model has not substantially overfit.
+(Weird to not use CV set, but model params and parts were
+just given, so no need.)
diff --git a/Course3/Notes/pca.md b/Course3/Notes/pca.md
@@ -0,0 +1,11 @@
+# PCA
+
+Each pc is projection that "explains" maximum variance.
+Used to be good for dimensionality reduction and compression
+, especiall during training or feature selection,
+but nowadays mainly used for visualization in AI/ML.
+
+Eigen vector and eigen value for deeper understanding.
+
+Just use sklearn.
+I published a paper on this so need for more.
diff --git a/pyproject.toml b/pyproject.toml
@@ -66,4 +66,7 @@ exclude = '''
   | htmlcov
   | .coverage
 )/
-'''
+'''
+
+[tool.mypy]
+plugins = "numpy.typing.mypy_plugin"
diff --git a/rawsight/recommender/collaborative_filtering.py b/rawsight/recommender/collaborative_filtering.py
@@ -0,0 +1,31 @@
+import numpy as np
+import numpy.typing as npt
+
+
+def cofi_cost_func(
+    X: npt.NDArray[np.number],
+    W: npt.NDArray[np.number],
+    b: npt.NDArray[np.number],
+    Y: npt.NDArray[np.number],
+    R: npt.NDArray[np.number],
+    lam: float,
+) -> float:
+    """Return cost with regularization using numpy for collaborative learning
+    Args:
+      X np(num_feature_samples, num_features)): matrix of feature samples
+      W np(num_parameter_samples, num_features)) : matrix of parameter samples
+      b np(1, num_parameter_samples)            : constant parameter vector per param sample.
+      Y np(num_feature_samples,num_parameter_samples) : matrix of pars per feature sample
+      R np(num_feature_samples,num_parameter_samples) : R(i, j) = 1 if feature sample has parameters.
+      lam (float): regularization parameter
+
+    Simples example is X features of movies and W is features of user ratings (for movies)
+    Y is matrix of user ratings for each movie and R just records if a user rated a movie.
+    """
+    # Regularization is simple and applies to all values
+    regularization: float = (np.sum(W**2) + np.sum(X**2)) * (lam / 2)
+
+    # Linear regression analog vectorized implementation.
+    cost: float = np.sum((R * (np.dot(X, W.T) + b - Y)) ** 2) / 2
+
+    return cost + regularization
File	Stmts	Miss	Cover	Missing
rawsight
__init__.py	9	0	100%
input_validation.py	14	4	4	71%
normalization.py	95	21	21	78%
optimizers.py	41	15	15	63%
regression.py	97	21	21	78%
scoring.py	3	1	1	67%
rawsight/cost_functions
__init__.py	2	0	100%
cost_function_factory.py	55	8	8	85%
cost_functions.py	40	7	7	82%
regularization.py	23	3	3	87%
rawsight/datasets
__init__.py	1	0	100%
datasets.py	94	39	39	59%
rawsight/models
__init__.py	4	0	100%
linear.py	23	3	3	87%
logistic.py	27	4	4	85%
model.py	112	34	34	70%
polynomial.py	11	8	8	27%
softmax.py	34	18	18	47%
rawsight/nn
__init__.py	0	0	100%
layers.py	57	57	57	0%
networks.py	28	28	28	0%
rawsight/tests
__init__.py	0	0	100%
test_binary_tree.py	26	0	100%
test_linear_regression.py	64	3	3	95%
test_logistic_regression.py	56	4	4	93%
test_normalization.py	45	5	5	89%
test_softmax.py	43	16	16	63%
rawsight/trees
__init__.py	4	0	100%
_splitter_protocol.py	6	1	1	83%
binary_tree.py	29	0	100%
infogain.py	29	2	2	93%
splitting.py	30	1	1	97%
tree.py	5	0	100%
tree_builder.py	47	1	1	98%
TOTAL	1154	304	74%