In this directory, a notebook is provided to illustrate evaluating models using various performance measures which can be found in reco_utils.
Notebook | Description |
---|---|
evaluation | Examples of different rating and ranking metrics in Python+CPU and PySpark environments. |
Two approaches for evaluating model performance are demonstrated along with their respective metrics.
- Rating Metrics: These are used to evaluate how accurate a recommender is at predicting ratings that users gave to items
- Root Mean Square Error (RMSE) - measure of average error in predicted ratings
- R Squared (R2) - essentially how much of the total variation is explained by the model
- Mean Absolute Error (MAE) - similar to RMSE but uses absolute value instead of squaring and taking the root of the average
- Explained Variance - how much of the variance in the data is explained by the model
- Ranking Metrics: These are used to evaluate how relevant recommendations are for users
- Precision - this measures the proportion of recommended items that are relevant
- Recall - this measures the proportion of relevant items that are recommended
- Normalized Discounted Cumulative Gain (NDCG) - evaluates how well the predicted items for a user are ranked based on relevance
- Mean Average Precision (MAP) - average precision for each user normalized over all users
- Arear Under Curver (AUC) - integral area under the receiver operating characteristic curve
- Logistic loss (Logloss) - the negative log-likelihood of the true labels given the predictions of a classifier
References:
- Asela Gunawardana and Guy Shani: A Survey of Accuracy Evaluation Metrics of Recommendation Tasks
- Dimitris Paraschakis et al, "Comparative Evaluation of Top-N Recommenders in e-Commerce: An Industrial Perspective", IEEE ICMLA, 2015, Miami, FL, USA.
- Yehuda Koren and Robert Bell, "Advances in Collaborative Filtering", Recommender Systems Handbook, Springer, 2015.
- Chris Bishop, "Pattern Recognition and Machine Learning", Springer, 2006.