Skip to content
This repository has been archived by the owner on Aug 9, 2024. It is now read-only.

Commit

Permalink
Merge pull request #49 from nokaut/dev
Browse files Browse the repository at this point in the history
Release 1.2.0 update
  • Loading branch information
SimonMolinsky authored Sep 1, 2023
2 parents f719b65 + f0cbf61 commit 6e7e662
Show file tree
Hide file tree
Showing 59 changed files with 1,005,247 additions and 166 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Specific folders *
.pytest_cache
.pytest_cache/*
demo-notebooks/demo-data/movielens/ml-25m/*.csv


# OS generated files #
Expand Down
6 changes: 4 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
## Version 1.1.2 (2023-07-)
## Version 1.2.0 (2023-09-)

- (docs) changed README example,
- (docs) added `demo-readme` example to `demo-notebooks` section,
- (docs) demo example in documentation updated,
- (docs) link to the documentation page added to README,

- (enhancement) item-sessions map may be derived from session-items map,
- (feature) data can be read from the flat record structure,
- (feature) data can be parsed from dataframes

## Version 1.1.1 (2023-07-08)

Expand Down
91 changes: 77 additions & 14 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Contribution to WSKNN

We love your input! We want to make contributing to this project as easy and transparent as possible, whether it's:
We love your input! We want to make contributing to this project as easy and transparent as possible, whether it's

* Reporting a bug
* Discussing the current state of the code
Expand All @@ -10,19 +10,85 @@ We love your input! We want to make contributing to this project as easy and tra

## Where should I start?

Here, on **GitHub**! We use **GitHub** to host the code, to track issues and feature requests, as well as accept pull requests.
Here, on **GitHub**! We use **GitHub** to host the code, track issues, and feature requests, and accept pull requests.

---

## Developer setup

Setup for developers differs from the package installation from `PyPI`.

1. Fork the `wsknn` repository.
2. Clone forked repository.
3. Connect the main repository with your fork locally:

```shell
git remote add upstream https://github.com/nokaut/wsknn.git

```

4. Synchronize your repository with the core repository.

```shell
git checkout main
git pull upstream main

```

5. Create your branch.

```shell
git checkout -b name-of-your-branch

```

6. Create [virtual environment](https://docs.python.org/3/library/venv.htmlc) or [conda environment](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands).
7. Activate your environment.
8. Install requirements listed in the `requirements-dev.txt` file.

**Virtual Environment**

```shell
>> (your-virtual-environment) pip install -r requirements-dev.txt

```

**Conda**

```shell
>> (your-conda-environment) conda install -c conda-forge --file requirements-dev.txt

```

9. Make changes in a code or write something new.
10. Write tests if required.
11. Perform tests with `pytest`. (Run tests from the `tests` directory).

```shell
>> (your-environment) (your-username:~/path/wsknn/tests) pytest

```

12. If all tests pass, push changes into your fork.

```shell
git add .
git commit -m "description what you have done"
git push origin name-of-your-branch

```

13. Navigate to your repository. You should see a button `Compare and open a pull request`. Use it to make a pull request! Send it to the `dev` branch in the main repository. **Don't send pull requests into the `main` branch of the core repository!**

## We Use [Github Flow](https://guides.github.com/introduction/flow/index.html), So All Code Changes Happen Through Pull Requests
Pull requests are the best way to propose changes to the codebase (we use [Github Flow](https://guides.github.com/introduction/flow/index.html)). We actively welcome your pull requests:

1. Fork the repo and create your branch from `main`.
2. If you've added code that should be tested, add tests in the `test` package. We use Python's `pytest` package to perform testing.
3. If you've changed APIs, update the documentation.
4. Ensure the test suite passes.
5. Make sure your code lints.
6. Issue that pull request!
4. Ensure the test suite passes. Run tests from `tests` directory, otherwise you will encounter an error.
5. Make sure your code lints, you can use `flake8` or linters included in a development software (*Pycharm*). Linters check [PEP8 Python Guidelines](https://www.python.org/dev/peps/pep-0008/), and it is recommended to read them first.
6. Issue that pull request.

## Any contributions you make will be under the BSD 3-Clause "New" or "Revised" License
In short, when you submit code changes, your submissions are understood to be under the same [BSD 3-Clause "New" or "Revised" License] that covers the project. Feel free to contact the maintainers if that's a concern.
Expand All @@ -42,10 +108,7 @@ We use GitHub issues to track public bugs. Report a bug by opening a new issue.
- What actually happens
- Notes (possibly including why you think this might be happening, or stuff you tried that didn't work)

People *love* thorough bug reports. I'm not even kidding.

## Use a PEP8 Guidelines
[PEP8 Python Guidelines](https://www.python.org/dev/peps/pep-0008/)
People *love* thorough bug reports.

## License
By contributing, you agree that your contributions will be licensed under its BSD 3-Clause "New" or "Revised" License.
Expand All @@ -57,9 +120,9 @@ This document was adapted from the open-source contribution guidelines for [Face

1. You have an idea to speed-up computation. You plan to use `multiprocessing` package for it.
2. Fork repo from `main` branch and at the same time propose change or issue in the [project issues](https://github.com/nokaut/wsknn/issues).
3. Create the new child branch from the forked `main` branch. Name it as `dev-your-idea`. In this case `dev-multiprocessing` is decriptive enough.
3. Create the new child branch from the forked `main` branch. Name it as `dev-your-idea`. In this case, `dev-multiprocessing` is descriptive enough.
4. Code in your branch.
5. Create few unit tests in `tests` directory or re-design actual tests if there is a need. For programming cases write unit tests, for mathematical and logic problems write functional tests. Use data from `tests/tdata` directory.
6. Multiprocessing maybe does not require new tests. But always run unittests in the `tests` directory after any change in the code and check if every test has passed.
7. Run all tutorials (`demo-notebooks`) too. Their role is not only informational. They serve as a functional test playground.
8. If everything is ok make a pull request from your forked repo.
5. Create a few unit tests in the `tests` directory or re-design actual tests if needed. For programming cases, write unit tests, for mathematical and logic problems, write functional tests. Use data from the `tests/tdata` directory.
6. Multiprocessing may not require new tests. But always run `pytest` in the `tests` directory after any change in the code and check if every test has passed.
7. Run all tutorials (`demo-notebooks`). Their role is more than just informational. They serve as a functional test playground.
8. If everything is okay, make a pull request from your forked repo.
15 changes: 10 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ The model was created along with multiple other approaches: based on RNN (GRU/LS

## What are the limitations of WSKNN?

- model memorizes session-items and item-sessions maps, and if your product base is large and you use sessions for an extended period, then the model may be too big to fit an available memory; in this case, you can
- model memorizes session-items and item-sessions maps, and if your product base is large, and you use sessions for an extended period, then the model may be too big to fit an available memory; in this case, you can
categorize products and train a different model for each category,
- response time may be slower than from other models, especially if there are available many sessions,
- there's additional overhead related to the preparation of the input.
Expand Down Expand Up @@ -131,10 +131,15 @@ It works with Python versions greater or equal to 3.8.

## Requirements

| Package Version | Python versions | Requirements |
|-----------------|-----------------|-------------------------------|
| 0.1.x | 3.6+ | numpy, pyyaml |
| 1.x | 3.8+ | numpy, pyyaml, more_itertools |
| Package Version | Python versions | Requirements |
|-----------------|-----------------|---------------------------------------------|
| 0.1.x | 3.6+ | numpy, pyyaml |
| 1.1.x | 3.8+ | numpy, more_itertools, pyyaml |
| 1.2.x | 3.8+ | numpy, more_itertools, pandas, pyyaml, tqdm |

## Contribution

We welcome all submissions, issues, feature requests, and bug reports! To learn how to contribute to the package please visit [CONTRIBUTION.md file](https://github.com/nokaut/wsknn/blob/main/CONTRIBUTING.md)

## Developers

Expand Down
157 changes: 157 additions & 0 deletions demo-notebooks/demo-data/movielens/ml-100k/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
SUMMARY & USAGE LICENSE
=============================================

MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.

This data set consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies.
* Each user has rated at least 20 movies.
* Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 19th,
1997 through April 22nd, 1998. This data has been cleaned up - users
who had less than 20 ratings or did not have complete demographic
information were removed from this data set. Detailed descriptions of
the data file can be found at the end of this file.

Neither the University of Minnesota nor any of the researchers
involved can guarantee the correctness of the data, its suitability
for any particular purpose, or the validity of results based on the
use of the data set. The data set may be used for any research
purposes under the following conditions:

* The user may not state or imply any endorsement from the
University of Minnesota or the GroupLens Research Group.

* The user must acknowledge the use of the data set in
publications resulting from the use of the data set
(see below for citation information).

* The user may not redistribute the data without separate
permission.

* The user may not use this information for any commercial or
revenue-bearing purposes without first obtaining permission
from a faculty member of the GroupLens Research Project at the
University of Minnesota.

If you have any further questions or comments, please contact GroupLens
<[email protected]>.

CITATION
==============================================

To acknowledge use of the dataset in publications, please cite the
following paper:

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets:
History and Context. ACM Transactions on Interactive Intelligent
Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.
DOI=http://dx.doi.org/10.1145/2827872


ACKNOWLEDGEMENTS
==============================================

Thanks to Al Borchers for cleaning up this data and writing the
accompanying scripts.

PUBLISHED WORK THAT HAS USED THIS DATASET
==============================================

Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic
Framework for Performing Collaborative Filtering. Proceedings of the
1999 Conference on Research and Development in Information
Retrieval. Aug. 1999.

FURTHER INFORMATION ABOUT THE GROUPLENS RESEARCH PROJECT
==============================================

The GroupLens Research Project is a research group in the Department
of Computer Science and Engineering at the University of Minnesota.
Members of the GroupLens Research Project are involved in many
research projects related to the fields of information filtering,
collaborative filtering, and recommender systems. The project is lead
by professors John Riedl and Joseph Konstan. The project began to
explore automated collaborative filtering in 1992, but is most well
known for its world wide trial of an automated collaborative filtering
system for Usenet news in 1996. The technology developed in the
Usenet trial formed the base for the formation of Net Perceptions,
Inc., which was founded by members of GroupLens Research. Since then
the project has expanded its scope to research overall information
filtering solutions, integrating in content-based methods as well as
improving current collaborative filtering technology.

Further information on the GroupLens Research project, including
research publications, can be found at the following web site:

http://www.grouplens.org/

GroupLens Research currently operates a movie recommender based on
collaborative filtering:

http://www.movielens.org/

DETAILED DESCRIPTIONS OF DATA FILES
==============================================

Here are brief descriptions of the data.

ml-data.tar.gz -- Compressed tar file. To rebuild the u data files do this:
gunzip ml-data.tar.gz
tar xvf ml-data.tar
mku.sh

u.data -- The full u data set, 100000 ratings by 943 users on 1682 items.
Each user has rated at least 20 movies. Users and items are
numbered consecutively from 1. The data is randomly
ordered. This is a tab separated list of
user id | item id | rating | timestamp.
The time stamps are unix seconds since 1/1/1970 UTC

u.info -- The number of users, items, and ratings in the u data set.

u.item -- Information about the items (movies); this is a tab separated
list of
movie id | movie title | release date | video release date |
IMDb URL | unknown | Action | Adventure | Animation |
Children's | Comedy | Crime | Documentary | Drama | Fantasy |
Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
Thriller | War | Western |
The last 19 fields are the genres, a 1 indicates the movie
is of that genre, a 0 indicates it is not; movies can be in
several genres at once.
The movie ids are the ones used in the u.data data set.

u.genre -- A list of the genres.

u.user -- Demographic information about the users; this is a tab
separated list of
user id | age | gender | occupation | zip code
The user ids are the ones used in the u.data data set.

u.occupation -- A list of the occupations.

u1.base -- The data sets u1.base and u1.test through u5.base and u5.test
u1.test are 80%/20% splits of the u data into training and test data.
u2.base Each of u1, ..., u5 have disjoint test sets; this if for
u2.test 5 fold cross validation (where you repeat your experiment
u3.base with each training and test set and average the results).
u3.test These data sets can be generated from u.data by mku.sh.
u4.base
u4.test
u5.base
u5.test

ua.base -- The data sets ua.base, ua.test, ub.base, and ub.test
ua.test split the u data into a training set and a test set with
ub.base exactly 10 ratings per user in the test set. The sets
ub.test ua.test and ub.test are disjoint. These data sets can
be generated from u.data by mku.sh.

allbut.pl -- The script that generates training and test sets where
all but n of a users ratings are in the training data.

mku.sh -- A shell script to generate all the u data sets from u.data.
34 changes: 34 additions & 0 deletions demo-notebooks/demo-data/movielens/ml-100k/allbut.pl
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/usr/local/bin/perl

# get args
if (@ARGV < 3) {
print STDERR "Usage: $0 base_name start stop max_test [ratings ...]\n";
exit 1;
}
$basename = shift;
$start = shift;
$stop = shift;
$maxtest = shift;

# open files
open( TESTFILE, ">$basename.test" ) or die "Cannot open $basename.test for writing\n";
open( BASEFILE, ">$basename.base" ) or die "Cannot open $basename.base for writing\n";

# init variables
$testcnt = 0;

while (<>) {
($user) = split;
if (! defined $ratingcnt{$user}) {
$ratingcnt{$user} = 0;
}
++$ratingcnt{$user};
if (($testcnt < $maxtest || $maxtest <= 0)
&& $ratingcnt{$user} >= $start && $ratingcnt{$user} <= $stop) {
++$testcnt;
print TESTFILE;
}
else {
print BASEFILE;
}
}
Loading

0 comments on commit 6e7e662

Please sign in to comment.