These codes accompany my presentation on Matrix Methods in Hadoop at the BIGDATA Techcon in Boston, MA in April 2013. I suspect they'll be used in other presentations as well.
The goal in these slides is to demonstrate how to implement simple matrix computations in Hadoop using Yelp's mrjob system.
- Sparse matrix-vector products
- Matrix-matrix products
- A recommender system for epinions data
-
Get mrjob working. Nothing here will require an actual MapReduce cluster, but feel free to use one if you wish! I setup a virtualenv for this and use pip.
mkdir envs virtualenv envs/mrjob source envs/mrjob/bin/activate pip install mrjob
-
Get the datasets for the recommender system
make getdata
-
Sparse matrix-vector products
python codes/smatvec.py samples/smat_10_5_A.txt samples/vec_5.txt # Compare the output to a non-MR computation python codes/test_smatvec.py samples/smat_10_5_A.txt samples/vec_5.txt
-
Sparse matrix-matrix products
python codes/matmat.py samples/smat_10_5_A.txt samples/smat_5_5.txt # Compare the output to a non-MR computation python codes/test_smatmat.py samples/smat_10_5_A.txt samples/smat_5_5.txt
Warning, this actually takes a while. I'm not sure where the bottle-neck is, but
python recsys/recsys.py data/rating.txt.gz data/user_ratings.txt.gz