Merge pull request #49 from nokaut/dev

Release 1.2.0 update
nokaut · Sep 1, 2023 · 6e7e662 · 6e7e662
2 parents f719b65 + f0cbf61
commit 6e7e662
Show file tree

Hide file tree

Showing 59 changed files with 1,005,247 additions and 166 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,7 @@
 # Specific folders *
 .pytest_cache
 .pytest_cache/*
+demo-notebooks/demo-data/movielens/ml-25m/*.csv
 
 
 # OS generated files #

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,10 +1,12 @@
-## Version 1.1.2 (2023-07-)
+## Version 1.2.0 (2023-09-)
 
 - (docs) changed README example,
 - (docs) added `demo-readme` example to `demo-notebooks` section,
 - (docs) demo example in documentation updated,
 - (docs) link to the documentation page added to README,
-
+- (enhancement) item-sessions map may be derived from session-items map,
+- (feature) data can be read from the flat record structure,
+- (feature) data can be parsed from dataframes
 
 ## Version 1.1.1 (2023-07-08)
 

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,6 +1,6 @@
 # Contribution to WSKNN
 
-We love your input! We want to make contributing to this project as easy and transparent as possible, whether it's:
+We love your input! We want to make contributing to this project as easy and transparent as possible, whether it's
 
 * Reporting a bug
 * Discussing the current state of the code
@@ -10,19 +10,85 @@ We love your input! We want to make contributing to this project as easy and tra
 
 ## Where should I start?
 
-Here, on **GitHub**! We use **GitHub** to host the code, to track issues and feature requests, as well as accept pull requests.
+Here, on **GitHub**! We use **GitHub** to host the code, track issues, and feature requests, and accept pull requests.
 
 ---
 
+## Developer setup
+
+Setup for developers differs from the package installation from `PyPI`.
+
+1. Fork the `wsknn` repository.
+2. Clone forked repository.
+3. Connect the main repository with your fork locally:
+
+```shell
+git remote add upstream https://github.com/nokaut/wsknn.git
+
+```
+
+4. Synchronize your repository with the core repository.
+
+```shell
+git checkout main
+git pull upstream main
+
+```
+
+5. Create your branch.
+
+```shell
+git checkout -b name-of-your-branch
+
+```
+
+6. Create [virtual environment](https://docs.python.org/3/library/venv.htmlc) or [conda environment](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands).
+7. Activate your environment.
+8. Install requirements listed in the `requirements-dev.txt` file.
+
+**Virtual Environment**
+
+```shell
+>> (your-virtual-environment) pip install -r requirements-dev.txt
+
+```
+
+**Conda**
+
+```shell
+>> (your-conda-environment) conda install -c conda-forge --file requirements-dev.txt
+
+```
+
+9. Make changes in a code or write something new.
+10. Write tests if required.
+11. Perform tests with `pytest`. (Run tests from the `tests` directory).
+
+```shell
+>> (your-environment) (your-username:~/path/wsknn/tests) pytest
+
+```
+
+12. If all tests pass, push changes into your fork.
+
+```shell
+git add .
+git commit -m "description what you have done"
+git push origin name-of-your-branch
+
+```
+
+13. Navigate to your repository. You should see a button `Compare and open a pull request`. Use it to make a pull request! Send it to the `dev` branch in the main repository. **Don't send pull requests into the `main` branch of the core repository!**
+
 ## We Use [Github Flow](https://guides.github.com/introduction/flow/index.html), So All Code Changes Happen Through Pull Requests
 Pull requests are the best way to propose changes to the codebase (we use [Github Flow](https://guides.github.com/introduction/flow/index.html)). We actively welcome your pull requests:
 
 1. Fork the repo and create your branch from `main`.
 2. If you've added code that should be tested, add tests in the `test` package. We use Python's `pytest` package to perform testing.
 3. If you've changed APIs, update the documentation.
-4. Ensure the test suite passes.
-5. Make sure your code lints.
-6. Issue that pull request!
+4. Ensure the test suite passes. Run tests from `tests` directory, otherwise you will encounter an error.
+5. Make sure your code lints, you can use `flake8` or linters included in a development software (*Pycharm*). Linters check [PEP8 Python Guidelines](https://www.python.org/dev/peps/pep-0008/), and it is recommended to read them first.
+6. Issue that pull request.
 
 ## Any contributions you make will be under the BSD 3-Clause "New" or "Revised" License
 In short, when you submit code changes, your submissions are understood to be under the same [BSD 3-Clause "New" or "Revised" License] that covers the project. Feel free to contact the maintainers if that's a concern.
@@ -42,10 +108,7 @@ We use GitHub issues to track public bugs. Report a bug by opening a new issue.
 - What actually happens
 - Notes (possibly including why you think this might be happening, or stuff you tried that didn't work)
 
-People *love* thorough bug reports. I'm not even kidding.
-
-## Use a PEP8 Guidelines
-[PEP8 Python Guidelines](https://www.python.org/dev/peps/pep-0008/)
+People *love* thorough bug reports.
 
 ## License
 By contributing, you agree that your contributions will be licensed under its BSD 3-Clause "New" or "Revised" License.
@@ -57,9 +120,9 @@ This document was adapted from the open-source contribution guidelines for [Face
 
 1. You have an idea to speed-up computation. You plan to use `multiprocessing` package for it.
 2. Fork repo from `main` branch and at the same time propose change or issue in the [project issues](https://github.com/nokaut/wsknn/issues).
-3. Create the new child branch from the forked `main` branch. Name it as `dev-your-idea`. In this case `dev-multiprocessing` is decriptive enough.
+3. Create the new child branch from the forked `main` branch. Name it as `dev-your-idea`. In this case, `dev-multiprocessing` is descriptive enough.
 4. Code in your branch.
-5. Create few unit tests in `tests` directory or re-design actual tests if there is a need. For programming cases write unit tests, for mathematical and logic problems write functional tests. Use data from `tests/tdata` directory.
-6. Multiprocessing maybe does not require new tests. But always run unittests in the `tests` directory after any change in the code and check if every test has passed.
-7. Run all tutorials (`demo-notebooks`) too. Their role is not only informational. They serve as a functional test playground.
-8. If everything is ok make a pull request from your forked repo.
+5. Create a few unit tests in the `tests` directory or re-design actual tests if needed. For programming cases, write unit tests, for mathematical and logic problems, write functional tests. Use data from the `tests/tdata` directory.
+6. Multiprocessing may not require new tests. But always run `pytest` in the `tests` directory after any change in the code and check if every test has passed.
+7. Run all tutorials (`demo-notebooks`). Their role is more than just informational. They serve as a functional test playground.
+8. If everything is okay, make a pull request from your forked repo.
diff --git a/README.md b/README.md
@@ -60,7 +60,7 @@ The model was created along with multiple other approaches: based on RNN (GRU/LS
 
 ## What are the limitations of WSKNN?
 
-- model memorizes session-items and item-sessions maps, and if your product base is large and you use sessions for an extended period, then the model may be too big to fit an available memory; in this case, you can 
+- model memorizes session-items and item-sessions maps, and if your product base is large, and you use sessions for an extended period, then the model may be too big to fit an available memory; in this case, you can 
 categorize products and train a different model for each category,
 - response time may be slower than from other models, especially if there are available many sessions,
 - there's additional overhead related to the preparation of the input.
@@ -131,10 +131,15 @@ It works with Python versions greater or equal to 3.8.
 
 ## Requirements
 
-| Package Version | Python versions | Requirements                  |
-|-----------------|-----------------|-------------------------------|
-| 0.1.x           | 3.6+            | numpy, pyyaml                 |
- | 1.x             | 3.8+            | numpy, pyyaml, more_itertools |
+| Package Version | Python versions | Requirements                                |
+|-----------------|-----------------|---------------------------------------------|
+| 0.1.x           | 3.6+            | numpy, pyyaml                               |
+| 1.1.x           | 3.8+            | numpy, more_itertools, pyyaml               |
+| 1.2.x           | 3.8+            | numpy, more_itertools, pandas, pyyaml, tqdm |
+
+## Contribution
+
+We welcome all submissions, issues, feature requests, and bug reports! To learn how to contribute to the package please visit [CONTRIBUTION.md file](https://github.com/nokaut/wsknn/blob/main/CONTRIBUTING.md)
 
 ## Developers
 

diff --git a/demo-notebooks/demo-data/movielens/ml-100k/README b/demo-notebooks/demo-data/movielens/ml-100k/README
@@ -0,0 +1,157 @@
+SUMMARY & USAGE LICENSE
+=============================================
+
+MovieLens data sets were collected by the GroupLens Research Project
+at the University of Minnesota.
+
+This data set consists of:
+	* 100,000 ratings (1-5) from 943 users on 1682 movies. 
+	* Each user has rated at least 20 movies. 
+        * Simple demographic info for the users (age, gender, occupation, zip)
+
+The data was collected through the MovieLens web site
+(movielens.umn.edu) during the seven-month period from September 19th, 
+1997 through April 22nd, 1998. This data has been cleaned up - users
+who had less than 20 ratings or did not have complete demographic
+information were removed from this data set. Detailed descriptions of
+the data file can be found at the end of this file.
+
+Neither the University of Minnesota nor any of the researchers
+involved can guarantee the correctness of the data, its suitability
+for any particular purpose, or the validity of results based on the
+use of the data set.  The data set may be used for any research
+purposes under the following conditions:
+
+     * The user may not state or imply any endorsement from the
+       University of Minnesota or the GroupLens Research Group.
+
+     * The user must acknowledge the use of the data set in
+       publications resulting from the use of the data set
+       (see below for citation information).
+
+     * The user may not redistribute the data without separate
+       permission.
+
+     * The user may not use this information for any commercial or
+       revenue-bearing purposes without first obtaining permission
+       from a faculty member of the GroupLens Research Project at the
+       University of Minnesota.
+
+If you have any further questions or comments, please contact GroupLens
+<[email protected]>. 
+
+CITATION
+==============================================
+
+To acknowledge use of the dataset in publications, please cite the 
+following paper:
+
+F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets:
+History and Context. ACM Transactions on Interactive Intelligent
+Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.
+DOI=http://dx.doi.org/10.1145/2827872
+
+
+ACKNOWLEDGEMENTS
+==============================================
+
+Thanks to Al Borchers for cleaning up this data and writing the
+accompanying scripts.
+
+PUBLISHED WORK THAT HAS USED THIS DATASET
+==============================================
+
+Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic
+Framework for Performing Collaborative Filtering. Proceedings of the
+1999 Conference on Research and Development in Information
+Retrieval. Aug. 1999.
+
+FURTHER INFORMATION ABOUT THE GROUPLENS RESEARCH PROJECT
+==============================================
+
+The GroupLens Research Project is a research group in the Department
+of Computer Science and Engineering at the University of Minnesota.
+Members of the GroupLens Research Project are involved in many
+research projects related to the fields of information filtering,
+collaborative filtering, and recommender systems. The project is lead
+by professors John Riedl and Joseph Konstan. The project began to
+explore automated collaborative filtering in 1992, but is most well
+known for its world wide trial of an automated collaborative filtering
+system for Usenet news in 1996.  The technology developed in the
+Usenet trial formed the base for the formation of Net Perceptions,
+Inc., which was founded by members of GroupLens Research. Since then
+the project has expanded its scope to research overall information
+filtering solutions, integrating in content-based methods as well as
+improving current collaborative filtering technology.
+
+Further information on the GroupLens Research project, including
+research publications, can be found at the following web site:
+
+        http://www.grouplens.org/
+
+GroupLens Research currently operates a movie recommender based on
+collaborative filtering:
+
+        http://www.movielens.org/
+
+DETAILED DESCRIPTIONS OF DATA FILES
+==============================================
+
+Here are brief descriptions of the data.
+
+ml-data.tar.gz   -- Compressed tar file.  To rebuild the u data files do this:
+                gunzip ml-data.tar.gz
+                tar xvf ml-data.tar
+                mku.sh
+
+u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
+              Each user has rated at least 20 movies.  Users and items are
+              numbered consecutively from 1.  The data is randomly
+              ordered. This is a tab separated list of 
+	         user id | item id | rating | timestamp. 
+              The time stamps are unix seconds since 1/1/1970 UTC   
+
+u.info     -- The number of users, items, and ratings in the u data set.
+
+u.item     -- Information about the items (movies); this is a tab separated
+              list of
+              movie id | movie title | release date | video release date |
+              IMDb URL | unknown | Action | Adventure | Animation |
+              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
+              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
+              Thriller | War | Western |
+              The last 19 fields are the genres, a 1 indicates the movie
+              is of that genre, a 0 indicates it is not; movies can be in
+              several genres at once.
+              The movie ids are the ones used in the u.data data set.
+
+u.genre    -- A list of the genres.
+
+u.user     -- Demographic information about the users; this is a tab
+              separated list of
+              user id | age | gender | occupation | zip code
+              The user ids are the ones used in the u.data data set.
+
+u.occupation -- A list of the occupations.
+
+u1.base    -- The data sets u1.base and u1.test through u5.base and u5.test
+u1.test       are 80%/20% splits of the u data into training and test data.
+u2.base       Each of u1, ..., u5 have disjoint test sets; this if for
+u2.test       5 fold cross validation (where you repeat your experiment
+u3.base       with each training and test set and average the results).
+u3.test       These data sets can be generated from u.data by mku.sh.
+u4.base
+u4.test
+u5.base
+u5.test
+
+ua.base    -- The data sets ua.base, ua.test, ub.base, and ub.test
+ua.test       split the u data into a training set and a test set with
+ub.base       exactly 10 ratings per user in the test set.  The sets
+ub.test       ua.test and ub.test are disjoint.  These data sets can
+              be generated from u.data by mku.sh.
+
+allbut.pl  -- The script that generates training and test sets where
+              all but n of a users ratings are in the training data.
+
+mku.sh     -- A shell script to generate all the u data sets from u.data.
diff --git a/demo-notebooks/demo-data/movielens/ml-100k/allbut.pl b/demo-notebooks/demo-data/movielens/ml-100k/allbut.pl
@@ -0,0 +1,34 @@
+#!/usr/local/bin/perl
+
+# get args
+if (@ARGV < 3) {
+	print STDERR "Usage: $0 base_name start stop max_test [ratings ...]\n";
+	exit 1;
+}
+$basename = shift;
+$start = shift;
+$stop = shift;
+$maxtest = shift;
+
+# open files
+open( TESTFILE, ">$basename.test" ) or die "Cannot open $basename.test for writing\n";
+open( BASEFILE, ">$basename.base" ) or die "Cannot open $basename.base for writing\n";
+
+# init variables
+$testcnt = 0;
+
+while (<>) {
+	($user) = split;
+	if (! defined $ratingcnt{$user}) {
+		$ratingcnt{$user} = 0;
+	}
+	++$ratingcnt{$user};
+	if (($testcnt < $maxtest || $maxtest <= 0)
+	&& $ratingcnt{$user} >= $start && $ratingcnt{$user} <= $stop) {
+		++$testcnt;
+		print TESTFILE;
+	}
+	else {
+		print BASEFILE;
+	}
+}