This repository contains a list of public and compatible datasets, noting other major repositories containing newer, and popular real-world datasets that are available, along with reference of sample code for respective recomendation tasks. Most of the datasets presented are for non-commercial use by academics, for example faculty, university researchers and other scientists. The datasets are free, however datasets may ask for citation.
In addition, there are a few links that may contain some sample code from existing works by their respective author. Before using these datasets, please review their sites and/ or README files for their respective usage licenses, acknowledgments and other details as a few datasets have additional citation requests. These requests can be found on the bottom of each dataset's web page.
Name: Jamell Dacon
Email: daconjam at msu dot edu ([email protected])
If you publish material based on material and/ or information obtained from this repository, then, in your acknowledgements, please note the assistance you received from utilizing this repository. By citing our paper as follows below, feel free to star and/ or fork the repository so that academics i.e. university researchers, faculty and other scientists may have quicker access to the available datasets. This will aid in directing others in obtaining the same datasets, thus allowing the replication and improvement of experiments.
Personal Page: Portfolio
Lab Page: DSELab@MSU
Here is a BiBTeX citation:
@inbook{10.1145/3442442.3452325, author = {Dacon, Jamell and Liu, Haochen}, title = {Does Gender Matter in the News? Detecting and Examining Gender Bias in News Articles}, year = {2021}, isbn = {9781450383134}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3442442.3452325}, abstract = {To attract unsuspecting readers, news article headlines and abstracts are often written with speculative sentences or clauses. Male dominance in the news is very evident, whereas females are seen as “eye candy” or “inferior”, and are underrepresented and under-examined within the same news categories as their male counterparts. In this paper, we present an initial study on gender bias in news abstracts in two large English news datasets used for news recommendation and news classification. We perform three large-scale, yet effective text-analysis fairness measurements on 296,965 news abstracts. In particular, to our knowledge we construct two of the largest benchmark datasets of possessive (gender-specific and gender-neutral) nouns and attribute (career-related and family-related) words datasets1 which we will release to foster both bias and fairness research aid in developing fair NLP models to eliminate the paradox of gender bias. Our studies demonstrate that females are immensely marginalized and suffer from socially-constructed biases in the news. This paper individually devises a methodology whereby news content can be analyzed on a large scale utilizing natural language processing (NLP) techniques from machine learning (ML) to discover both implicit and explicit gender biases. }, booktitle = {Companion Proceedings of the Web Conference 2021}, pages = {385–392}, numpages = {8} }
Note: ASU Social Computing Data Respository contains several Network Datasets
- UC Irvine Machine Learning Repository
- Stanford Large Network Dataset Collection
- Yahoo Research Webscope Datasets
Note: Yahoo Research Ratings and Classification Data Music, Movies, Tags, Clicks, Images & Videos: This sets of datasets contains music ratings, movie ratings, popular URLs and tags, click log dataset, face images of celebrities and 22K videos.
The following datasets are very popular in Recommender Systems, below are also brief dataset descriptions.
-
MIND dataset was collected from the Microsoft News website, for more detailed information about the MIND dataset, you can refer to the following paper: MIND paper, (Wu et al., 2020). They randomly sampled news from from October 12 to November 22, 2019 for 6 weeks creating two datasets i.e., MIND and MIND-small both totalling in 161,013 news articles. Each news article contains a news ID, a category label, a title, and a body (url); however, not every article contains an abstract resulting in 96,112 abstracts. We used the training set (largest set of news articles) since both the validation and test sets are subsets of the training set. MIND is created to serve as a new news recommendation benchmark dataset.
-
NCD dataset was collected from Huffpost. The news articles were sampled from news headlines from the year 2012 to 2018 totalling in 202,372 news articles. Each news article contains a category label, headline, authors, link, and date; however, not every article contains a short description (abstract) resulting in 200,853 abstracts. NCD serves as a news classification and recommendation benchmark dataset.
-
ANTCD dataset was collected by Zhang et al. from over 2000 news sources by ComeToMyHead (an online academic news search engine) for a under 2 years of activity. They access the original AG's News Corpus which contained 496,835 news articles, and by choosing the 4 categories with largest samples (30,000 articles each), thus creating the ANTCD Dataset with 120,000 news articles. Each news article contains a category (class index), a title and an abstract. We used the training set (largest set of news articles) since the test set is a subset that only contains 7600 testing samples. ANTCD serves as a news classification and recommendation benchmark dataset.
- Amazon: This Amazon dataset consists of reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs) spanning from May 1996 to July 2014.
- Amazon - Ratings (Beauty Products): This is a dataset related to over 2 Million customer reviews and ratings of Beauty related products sold on their website.
- Toy Products on Amazon: This is a pre-crawled dataset, taken as subset of a bigger dataset (more than 115k products) that was created by extracting data from Amazon.com.
- Slashdot: The network cotains friend/foe links between the users of Slashdot which was obtained in February 2009.
- Taobao: This dataset contains anonymized users' shopping logs in the past 6 months before and on the "Double 11" day,and the label information indicating whether they are repeated buyers. Due to privacy issue, data is sampled in a biased way, so the statistical result on this data set would deviate from the actual of Tmall.com.
- Microsoft Web Data Dataset: This dataset contains a log of anonymous users of www.microsoft.com; with the task predict areas of the web site a user visited based on data on other areas the user visited.
- Retailrocket recommender system dataset: This dataset consists of three files: a file with behaviour data (events.csv), a file with item properties (item_properties.сsv) and a file, which describes category tree (category_tree.сsv). The data has been collected from a real-world ecommerce website.
- Wikipedia: Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries.
- Airbnb Collection: The data was take of http://tomslee.net/airbnb-data-collection-get-the-data, this represent a response of the Barcelona City. The data is collected from the public Airbnb web site without logging in and the code was use is available on https://github.com/tomslee/airbnb-data-collection.
- Yelp: This Yelp dataset is a subset of businesses, reviews, and user-generated data for personal, educational, and academic purposes. This dataset is available in both JSON and SQL files, which can use it to teach students about databases, to learn NLP, or for sample production data while you learn how to make mobile apps.
- Facebook: This dataset contains exploratory data analysis that gives insights from a Facebook dataset which consists of identifying users that can be focused more to increase the business. These valuable insights should help Facebook to take intelligent decision to identify its useful users and provide correct recommendations to them.
- Twitter: This dataset consists of 'circles' (or 'lists') from Twitter. Twitter data was crawled from public sources. The dataset includes node features (profiles), circles, and ego networks.
- Pinterest: This dataset contains the scene-product pairs for fashion and home, respectively.
- Spanish Stocks Historical Data from 2000 to 2019: This dataset contains retrieved retrieve historical data from the companies that integrate the Continuous Spanish Stock Market. May have to refer investpy from Investing.com
- Stock Exchange: This dataset is the ZZAlpha® machine learning recommendations made for various US traded stock portfolios the morning of each day during the 3 year period Jan 1, 2012 - Dec 31, 2014.
- Job Recommendation: This dataset contains a list of recommended jobs listed for individual.
- Job Recommendation Analysis: A recommendation engine which is build using NLTK helping the applicants to choose thier preferred job based on their application. You will learn how lemmetizer, stemming and vectoriztion are used to process the data and have a better output.
- Item Learning: A dataset for Learning from Sets of Items in Recommender Systems (2019)
- eCommerce Item Dataset: This dataset contains 500 actual SKUs from an outdoor apparel brand's product catalog.
- Epinions: Epinions is a website where people can review products where users can register for free and start writing subjective reviews about many different types of items.
- Good Reads: This dataset's purpose is for the requirement of a good clean dataset of books.
- Book Crossing: The BookCrossing (BX) dataset was collected by Cai-Nicolas in a 4-week crawl (August / September 2004) from the Book-Crossing community.
- Open OSM: This data is from OpenStreetMap which is a collaborative mapping project, sort of like Wikipedia but for maps. For reference of python, a few scripts are available at [Hermes repo].(https://github.com/lab41/hermes)
- Dating Agency: This dataset contains 17,359,346 anonymous ratings of 168,791 profiles made by 135,359 LibimSeTi users as dumped on April 4, 2006.
- Personality 2018: The purpose of this dataset is for “User personality and user satisfaction with recommender systems".
- DEAPdataset: This is a dataset for emotion analysis using eeg, physiological and video signals.
- MyPersonalityDataset: This dataset contains information from a popular Facebook application that allowed users to take real psychometric tests, and allowed their Facebook profiles and psychological responses to be recorded (with consent!). Currently, the database contains more than 6,000,000 test results, together with more than 4,000,000 individual Facebook profiles.
- Million Song Dataset: The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. For code for the dataset, refer to MSongDB repo.
- LastFM (Implicit): This dataset contains social networking, tagging, and music artist listening information from a set of users from Last.fm online music system, consisting of 92,800 artist listening records from 1892 users.
- Netflix: This Netflix dataset is the official dataset that was used in the Netflix Prize competition.
- MovieLens: GroupLens Research has collected and made available rating datasets from their movie web site consisting of 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.
- Flixster: Flixster is a social movie site allowing users to share movie ratings, discover new movies and meet others with similar movie taste.
- IMDB: This is a link dataset built with permission from the Internet Movie Data (IMDB).
- CiaoDVD & Epinions: The CiaoDVD is a dataset crawled from the entire category of DVDs, and the Epinions dataset for each user, in their profile, it contains their ratings and trust relations. For each rating, the product name and its category, the rating score, the time point when the rating is created, and the helpfulness of this rating.
- Anime Recommendations Database: This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.
- Anime Data: Japanese animation, which is known as anime, has become internationally widespread nowadays. This dataset provides data on anime taken from Anime News Network.
- Resturant and Constumer: This dataset was obtained from a recommender system prototype, with the task to generate a top-n list of restaurants according to the consumer preferences.
- Chicago Entree: This is a dataset containing a record of user interactions with the Entree Chicago restaurant recommendation system.
- Steam Video Games: This dataset is a list of user behaviors, with columns such as user-id, game-title, behavior-name, value. The behaviors included are 'purchase' and 'play'. The value indicates the degree to which the behavior was performed - in the case of 'purchase' the value is always 1, and in the case of 'play' the value represents the number of hours the user has played the game.
- Steam Reviews Dataset: This dataset contains reviews from Steam's best selling games as February 2019.
- Jester: This is a Joke dataset containing 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,496 users.
- Citation Network: The data set is designed for research purpose only. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title.
- YAGO: YAGO is a huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames. Currently, YAGO has knowledge of more than 10 million entities (like persons, organizations, cities, etc.) and contains more than 120 million facts about these entities.
- Complete Collection of Kaggle Datasets: (below is more information pertaining to this dataset)
- Context: For many data analysts it is often complicated to find the right dataset for a project or to make some practice, so this collection of Kaggle datasets helps them to explore the available opportunities that Kaggle offers.
- Content: Part of the data has been first collected using the Kaggle API to retrieve the full list datasets, then each URL reference has been leveraged with a Python script in order to retrieve more detailed information.
- Recommender Systems Basics
- Nearest Neighbor Search
- Classic Matrix Facotirzation
- Singular Value Decomposition (SVD)
- SVD++
- Content-based CF / Context-aware CF
- there are so many ...
- Advanced Matrix Factorization
- Factorization Machine
- Sparse LInear Method (SLIM)
- Learning to Rank
- Cold-start
- Network Embedding
- Sequential-based
- Translation Embedding
- Graph-Convolution-based
- Knowledge-Graph-based
- Deep Learning
- Deep Neural Networks for YouTube Recommendations
- Deep Learning based Recommender System: A Survey and New Perspectives
- Neural Collaborative Filtering
- Collaborative Deep Learning for Recommender Systems
- Collaborative Denoising Auto-Encoders for Top-N Recommender Systems
- Collaborative recurrent autoencoder: recommend while learning to fill in the blanks
- TensorFlow Wide & Deep Learning
- Deep Neural Networks for YouTube Recommendations
- Collaborative Memory Network for Recommendation Systems
- Variational Autoencoders for Collaborative Filtering
- Recommender Systems Specialization, University of Minnesota
- Introduction to Recommender Systems: Non-Personalized and Content-Based, University of Minnesota
- Kaggle - product recommendations, hotel recommendations, job recommendations, etc.
- ACM RecSys Challenge
- WSDM Cup 2018
- Million Song Dataset Challenge
- Netflix Prize