GitHub - hailindu/Goodreads-Reviews-Analysis

Goodreads Web Scraping

The purpose of this project to build a web scraping script to obtain 100 pages of books (2000 books) related to Computer Science, and to do a statistical test to see does the number of pages lead to a higher average rating.

Dataset

The dataset is obtained from Goodreads under the search bar page. We will be using the search bar to search books for Computer Science. The search pages only extended to 100 pages, which contains 2000 books of information (1 search page contains 20 books). Each book has a personalized web link that allows users to click in to check for book information. Each book must have a title, and may or may not have the following information:

Book title (Must Have)
Author
Average rating
Rating count
Review count
Number of Page
Format
Languages
First Publish Year
Editions
ISBN
Link

Search Bar Page

Book Info from Sublink

Install & Quickstart

We will be using MongoDB to store our data during the Web Scraping process. That allows us to do Exploratory Data Analysis simultaneously without the Web Scraping process is complete.

As a result, you will need to install Docker, Docker Compose, and register a Docker Hub account. Then, we will pull the MongoDB image using Docker. Inside your terminal, you should already able to run Docker Hello World, then you should run the following command to pull MongoDB:

Using Mongo:

$ docker run --name mongoserver -p 27017:27017 -v "$PWD":/Your-Working-Directory -d mongo

Starting Mongo

$ docker start mongoserver

Technical Workflow

Part1: Web Scraping

We will have two seperate juypter notebook, one for web scraping purpose call Goodreads Web Scraping Notebook.ipynb, and another is for EDA & Statistical Tests purpose callGoodreads Exploratory Data Analysis & Statistical Tests.ipynb.

You will need to:

search a genre you interested in
click on the second page to obtain the url link
replace the link of the search bar page in the jupyter notebook

For example:

Page 1/Default Search Bar Page: https://www.goodreads.com/search?utf8=%E2%9C%93&q=Computer+Science&search_type=books
Page 2: https://www.goodreads.com/search?page=2&q=Computer+Science&qid=A44Zff48B9&search_type=books&tab=books&utf8=%E2%9C%93

We noticed that the default, page 1, does not contain the number of pages page=2&q=Computer+Science. As a result, you will need to start on Page 2, copy the URL inside the for loop.

Part2: Exploratory Data Analysis & Statistical Tests

We can do simple EDA while the web scraping is still spinning and understand our dataset in advance. After all data (books) are being collected and stored in our MongoDB, then we can do statistical tests to answer our question.

Always remember to create a MongoDB first to store your data in Part 1.

client = MongoClient('localhost', 27017)
db = client['book']
book_info = db['book_info']

Findings

Not only answering our original question - Does the number of pages lead to a higher average rating? I further expand that to other aspects

Average Rating (No signifcant difference between Books with longer number of pages of short number of number of pages)
Review Count (No significant difference)
Number of Book Editions (No significant difference)
Rating Count (sigificant difference acorrding to the test!)

A book with more pages DOES tend to be rated more often according to the Statistical T-Tests.
A Book has long number of pages DOES NOT lead to a higher Average Rating, Reviews Count, and Editions according to the Statistical T-Tests.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
Dataset		Dataset
Images		Images
Goodreads Exploratory Data Analysis & Statistical Tests.ipynb		Goodreads Exploratory Data Analysis & Statistical Tests.ipynb
Goodreads Web Scraping Notebook.ipynb		Goodreads Web Scraping Notebook.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goodreads Web Scraping

Dataset

Search Bar Page

Book Info from Sublink

Install & Quickstart

Technical Workflow

Part1: Web Scraping

Part2: Exploratory Data Analysis & Statistical Tests

Findings

Tableau (Data Visualization)

About

Releases

Packages

Languages

License

hailindu/Goodreads-Reviews-Analysis

Folders and files

Latest commit

History

Repository files navigation

Goodreads Web Scraping

Dataset

Search Bar Page

Book Info from Sublink

Install & Quickstart

Technical Workflow

Part1: Web Scraping

Part2: Exploratory Data Analysis & Statistical Tests

Findings

Tableau (Data Visualization)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages