I. Introduction
II. Data Availability and Provenance Statements
- Data Availability
- Statement about Rights
- Summary of Availability
- Repository Structure
- Details on the Data Source
- Details on the Raw Data Source
III. Requirements
V. Instructions to Replicators
Author | Contact |
---|---|
Bianca Magalotti | [email protected] |
Emanuele Franceschini | [email protected] |
Tommaso Rabino | [email protected] |
Akram Benmrit | [email protected] |
Colin Tan | [email protected] |
The rapid growth of Airbnb, a global platform for short-term property rentals, has introduced new challenges and opportunities for hosts, necessitating data-driven tools and insights to empower them. The objective of this study is twofold. First, an XGBoost Regression Model designed to assist Airbnb hosts in setting optimal rental prices for their listings was developed. Second, insights from the regression model and other data analysis techniques were levereged to highlight the critical factors influencing listing prices. Our findings not only provide valuable decision-making tools for hosts but also contribute to the broader discourse on short-term property rentals in Italy.
- Regression Model: Is it possible to develop a predictive model able to help Airbnb hosts in setting the optimal rental prices for their listings?
- Exploratory Data Analysis: How to use exploratory data analysis to help Airbnb hosts understand the critical factors that influence the price of their listings?
The code in this replication package constructs all the files in the folders data preparation
and analysis
from the data source (Milan listings.csv from InsideAirbnb) using R. Six main code files (5 in RMarkdown, 1 in plain R) run all of data, insights and results discussed in the final paper, which can be find in the following folder: gen/paper/output. The replicator should expect the code to run for about 4 hours.
The data was gathered from Inside Airbnb, a website operated directly by Airbnb that provides publicly available dataset of Airbnb listings. This platform provides the most recent version of required data for a multitude of cities around the world, allowing users to download specific datasets for their specific needs.
Specifically, the website provides both a downloader page, where the dataset can be freely downloaded (Inside Airbnb: Get the Data), and an explorer page, where a user-friendly tool enables users to analyze and explore specific listings dataset (Inside Airbnb: Explore).
For the purposes of this project, the dataset used was “listings.csv.gz”, representing the detailed listings data specifically for the city of Milan, last update 21 June, 2023. The dataset can be downloaded and explored from the following two links: Inside Airbnb: Milan Listings Downloader; Inside Airbnb: Milan Listings Explorer.
- I certify that the author(s) of the manuscript have legitimate access to and permission to use the data used in this manuscript.
- All data are publicly available.
- Some data cannot be made publicly available.
- No data can be made publicly available.
|-- data
|-- dataset1
|-- gen
|-- analysis
|-- input
|-- data-preparation
|-- input
|-- paper
|-- input
|-- output
|-- src
|-- analysis
|-- 5_regression_model
|-- clean-up_an
|-- data-preparation
|-- 0_installing_packages
|-- 1_data_download
|-- 2_data_cleaning
|-- 3_data_exploration
|-- 4_data_preparation
|-- clean-up_dp
|-- shinyapp
|-- 6_shinyapp
|-- save_time_objects
|-- .gitignore
|-- README.md
|-- makefile
- Download the Data: The raw dataset can be downloaded from the following link: Inside Airbnb: Milan Listings Downloader.
Additionally, once all source files of the present project are run, the following dataset will be automatically stored in the described folders and formats:
- Raw Dataset: The raw dataset (systematically extracted from the zip file that can be downloaded as described above) will be stored in the following two folders and formats:
Folder | Name | Format |
---|---|---|
data/dataset1 |
milan_listings | .csv |
gen/data-preparation/input |
raw_data | .rds |
- Cleaned Dataset: After all data cleaning and feature enginnering operations performed in the source code file 2_data_cleaning the cleaned dataset will be stored in the following folder and format:
Folder | Name | Format |
---|---|---|
gen/data-preparation/input |
clean_data | .rds |
- Regression Data: After all data preparation operations performed in the source code file 4_data_preparation the dataset used for modelling the regression will be stored in the following folder and format:
Folder | Name | Format |
---|---|---|
gen/analysis/input |
regression_data | .rds |
These steps ensure that users can always inspect the dataset characteristics at each stage of the project.
The dataset used for the present project comprises 23,142 observations, each corresponding to individual Airbnb listings in Milan, spanning 75 variables that offer diverse insights into listing characteristics, management, and performance.
It's crucial to highlight that not all listed variables will be utilized in subsequent analyses. Some will be omitted due to their lack of relevance or contribution to the analytical objectives of this project. Conversely, in the course of analysis, certain feature engineering operations were applied to create new variables. This enhances the encapsulation and representation of information in the dataset, facilitating a more comprehensive understanding of trends, patterns, and insights. Finally, due to data cleaning operations, the number of observations in the dataset may vary.
Here’s an overview of what each (original) variable represents:
Variable | Description or Motivation for Removal |
---|---|
id |
A unique identifier for the listing. |
listing_url |
The URL of the listing on Airbnb. |
scrape_id |
The unique id of the scraping session. |
last_scraped |
The date when the data for the listing was last scraped. |
source |
The origin of the listing data. |
name |
The name of the listing. |
description |
A comprehensive description of the listing. |
neighborhood_overview |
An overview of the listing's neighborhood. |
picture_url |
The URL of the listing's featured picture. |
host_id |
A unique identifier for the host of the listing. |
host_url |
The URL of the host's Airbnb profile. |
host_name |
The name of the host. |
host_since |
The date when the host joined Airbnb. |
host_location |
The location of the host. |
host_about |
Information provided by the host about themselves. |
host_response_time |
The typical amount of time the host takes to respond to messages. |
host_response_rate |
The host’s response rate to messages. |
host_acceptance_rate |
The rate at which the host accepts booking requests. |
host_is_superhost |
Indicator of whether the host is a Superhost. |
host_thumbnail_url |
The URL of the host’s thumbnail picture. |
host_picture_url |
The URL of the host’s profile picture. |
host_neighbourhood |
The neighborhood the host is located in. |
host_listings_count |
The total number of listings the host has. |
host_total_listings_count |
The total number of listings the host has across all platforms. |
host_verifications |
The methods the host has used to verify their identity. |
host_has_profile_pic |
Indicator of whether the host has a profile picture. |
host_identity_verified |
Indicator of whether the host’s identity has been verified. |
neighbourhood |
The neighborhood the listing is located in. |
neighbourhood_cleansed |
The cleaned name of the neighborhood the listing is located in. |
neighbourhood_group_cleansed |
The cleaned name of the neighborhood group the listing is located in. |
latitude & longitude |
The geographical coordinates of the listing. |
property_type |
The type of property listed. |
room_type |
The type of room listed. |
accommodates |
The number of people the listing can accommodate. |
bathrooms |
The number of bathrooms in the listing. |
bathrooms_text |
Textual description of the bathrooms. |
bedrooms |
The number of bedrooms in the listing. |
beds |
The number of beds in the listing. |
amenities |
The amenities offered by the listing. |
price |
The price of the listing per night. |
minimum_nights to maximum_nights |
Various restrictions and requirements related to the minimum and maximum nights a guest can book. |
calendar_updated |
When the listing’s calendar was last updated. |
has_availability |
Indicator of whether the listing is available. |
availability_30 to availability_365 |
The number of days the listing is available over different time spans. |
calendar_last_scraped |
The date when the listing’s calendar was last scraped. |
number_of_reviews to number_of_reviews_l30d |
Various measures of the number of reviews the listing has received. |
first_review & last_review |
Dates of the first and last reviews received. |
review_scores_rating to review_scores_value |
Various scores representing the quality of the listing as rated by guests. |
license |
The license number of the listing, if applicable. |
instant_bookable |
Indicator of whether the listing can be booked instantly. |
calculated_host_listings_count to calculated_host_listings_count_shared_rooms |
Various measures of the number of listings the host has. |
reviews_per_month |
The average number of reviews the listing receives per month. |
The present project does not involve exceptionally large datasets, and the R environment is systematically cleaned by a specific code snippet at the end of each code file, making the project accessible for a standard PC commonly available in 2023. The whole set of code files were last run on a Apple MacBook Air (2020), with the following technical specifications: (i) CPU: Apple M1 8-core - 3.2 GHz; (ii) GPU: Apple M1 7-Core GPU; (iii) RAM: 8GB; (iv) SSD: 256GB; (v) Operating System: MacOS 14.
-
R & RStudio --> The code was developed and executed in R (R version 4.2.2), utilizing RStudio (RStudio version 2022.12.0+353) as the integrated development environment (IDE). The software and the programming language can be installed from this link: R and RStudio Installation Guide.
-
R Libraries and Packages --> The source code files utilize the following R packages and libraries. You do nnot need to download or load them in advance, the source code file 0_installing_packages will handle this issue for you.
- GENERAL PACKAGES:
library(readr)
library(tidyverse) #A "Package of Packages" for Data manipulation and visualization (includes magrittr, lubridate, purrr, tidyr, etc.).
library(dplyr) #Data frame manipulations (select, slice, etc.
library(jsonlite) #For Amenities Columns Creation
library(moments) #Measuring the skewness.
- REGRESSION PACKAGES
library(caret) #Hyperparameters Tuning.
library(xgboost) #XGBoost Regression.
library(DALEX) #Summary of the XGBoost Regression Model ("explainer).
library(bayesforecast) #Checking Regression Assumptions.
- SHINYAPP PACKAGES
library(shiny) #For the ShinyApp
library(shinyWidgets) #For the ShinyApp
- PLOT AND FIGURES PACKAGES:
library(ggplot2) #Building fancy plots.
library(ggthemes) #Themes for ggplots (e.g. "solarized").
library(ggcorrplot) #For correlograms
library(scales) #Scaling and formatting ggplots (e.g. scale_fill_gradient()).
library(gt) #Latex tables
- WORKING DIRECTORY SETTING PACKAGES
library(here)
library(rstudioapi)
-
LaTex Distribution --> To compile the final paper into A PDF document with LaTeX styling, you need to have a LaTeX distribution installed on your computer. To install a LaTeX Distribution:
- On Windows: You can use distributions like MiKTeX or TeX Live. You can download the from the following links: MikTex Download; Tex Live Download
- On macOS: MacTeX is a popular distribution. You can download it from the following link: MacTex Download
-
LaTex Packages --> For the same purpose, you also need to need to to make sure that the following necessary LaTeX packages are installed in your LaTeX distribution. You can typically install missing packages using the package manager of your LaTeX distribution.
%----------------------------------------------------------------------------------------
% FONTS, MARGINS, AND PDF STYLING
%----------------------------------------------------------------------------------------
- babel: Language settings.
- fontenc: Font encoding.
- inputenc: Required for inputting international characters.
- mathpazo: Use the Palatino font.
- microtype: Slightly tweak font spacing for aesthetics.
- mathptmx: Times New Roman font for text.
- helvet: Arial-like font for sans-serif.
- setspace: Line spacing.
- geometry: set the margin.
- amsmath: Math equations.
- amssymb: Math symbols.
- hyperref: Hyperlinks and URLs.
- enumerate: Enumerate environment.
- enumitem: Required for list customization.
- multicol: For two columns.
%----------------------------------------------------------------------------------------
% HEADERS, FOOTERS, TITLE, ABSTARCT, BIBLIOGRAPHY, CAPTIONS AND GRAPHICS
%----------------------------------------------------------------------------------------
- fancyhdr: Header and footer customization.
- titlesec: Section titles formatting.
- titling: Required for customizing the title section.
- biblatex: to style bibliography.
- natbib: Citation style.
- appendix: Appendix formatting.
- abstract: Abstract formatting.
- caption: Captions customization.
- graphicx: Graphics.
%----------------------------------------------------------------------------------------
% TABLES
%----------------------------------------------------------------------------------------
\usepackage{color}
\usepackage{rotating}
\usepackage{tabularray}
\usepackage{booktabs}
%----------------------------------------------------------------------------------------
% TABLES
%----------------------------------------------------------------------------------------
- etoolbox
- footmisc
- listings
-
RMarkdown --> RMarkdown was used to convert the code from RStudio into more comprehensible pdf documents, allowing for a seamless representation of the analysis flow. Refer to the RMarkdown Installation Guide" for detailed instructions On how to install RMarkdown into your RStudio environment.
-
make --> The build tool make was employed to manage the automation of the compilation of all source code files and the final paper pdf document. The guide to install make can be found at this page: Make Installation Guide.
-
Pandoc --> Finally, to make sure that your computer is able to compile the pdf document resulting from the RMarkdown source files, you should install Pandoc by following this guide: Pandoc Installation Guide.
The whole set of code files were last run on a Apple MacBook Air (2020). On this hardware, the code took almost 5 hours to generate the whole output. Most of the time required to run the entire code is spent to process the following R objects:
- xgb_caret.rds (4 hours)
- xgb_mod.rds (5 minutes)
- xgbcv.rds (10 minutes)
Therefore, in case you want to save almost the entire time needed to run the whole set of source code files, follow the "Save Time Instructions" in the section "Instructions to Replicators".
- Random seed is set at the beginning of each source code file. The seed is set to
set.seed(999)
All source code files present in this repository are described in the table below:
File Name | File Format | File Description | File Location | File Output |
---|---|---|---|---|
0_installing_packages | .R | Installs all the necessary R packages and set the working directory to the source file location. | src/data-preparation | N/A |
1_data_download | .Rmd | Downloads the data source zip file, extract the database, and load it into R. | src/data-preparation | 1_data_download.pdf |
2_data_cleaning | .Rmd | Cleans the dataset from NAs, outliers, and useless or empty columns. | src/data-preparation | 2_data_cleaning.pdf |
3_data_exploration | .Rmd | Set of EDA operations, including correlograms and categorical variable visualizations. | src/data-preparation | 3_data_exploration.pdf |
4_data_preparation | .Rmd | Set of operations needed to prepare the dataset for the regression modeling, including computation of logarithm of the DV, one-hot encoding of factor variables, centering and scaling numeric variables, and dividing the dataset into a training and a testing dataset. | src/data-preparation | 4_data_preparation.pdf |
5_regression_model | .Rmd | Hyperparameter tuning, determining the optimal number of iterations, training the model and assessing its performance, checking regression assumptions. | src/analysis | 5_regression_model.pdf |
6_shinyapp | .R | Develops an interactive and user-friendly ShinyApp capable of predicting the price of an Airbnb listing located in Milan. The ShinyApp uses the previously trained and validated regression model to predict the price of a listing whose characteristics (number of rooms, beds, bathrooms, and accommodated people, location, type of apartment, etc.) can be defined a priori by the user. | src/analysis | ShinyApp Interface |
clean-up_dp | .R | Eliminates all not relevant file, including .RHistory and .RData, from the folder src/data-preparation. | src/shinyapp | N/A |
clean-up_an | .R | Eliminates all not relevant file, including .RHistory and .RData, from the folder src/analysis. | src/analysis | N/A |
final_paper | Pdf file with all results and insights gained from the anlysis. | gen/paper/output | N/A |
To automatically run all source code files of which this project is composed, pleas follow these instructions:
- Copy the HTML code of this GitHub repository.
- Open your command line / terminal and select a working directory where you want to store this project's repository. The following is an example of how to change the working directory (replace "C:/Users/Admin/Desktop" with the name of your selected directory):
cd "C:/Users/Admin/Desktop"
- Then, copy and paste the following command to your command line / terminal (you can also manually copy-and-paste the HTML code of this GitHub repository that you have previously copied in step 1):
git clone https://github.com/course-dprep/team-project-team_9_group_project.git
- Set your working directory to the project repository using the following command (replace your_repository_path with the directory you have selected in step 2):
cd "your_repository_path/team-project-team_9_group_project"
- Type the following command on your terminal / command prompt:
make
- When make has succesfully run all the code, directly on your terminal / command prompt it will appear a message such as the following:
Listening on http://127.0.0.1:3580
-
Open the link that appear on the screen of your terminal / command prompt in your browser (e.g. Google Chrome, Safari, etc.).
-
The ShinyApp will be rendered and you will be able to start playing with it :)
- Note: when the command line/terminal is closed, the ShinyApp will not be available anymore.
In case you want to save almost the entire time needed to run the whole set of source code files of which this project is composed, follow these instructions:
- Before running the project (i.e. before proceding with "step 4" of the "Step by Step" guide, see above), go to the project's directory (following "step 1", "step 2", and "step 3" should have allowed you to clone the project's repository locally on your device).
- Open the folder save-time-objects.
- Copy the 3 files you find inside this folder (xgb_caret, xgb_mod, xgbcv).
- Paste these 3 files in the following folder gen/analysis/input.
- Continue to follow the instructions in the "Step by Step" guide (from "step 4" to "step 8").
The 5_regression_model file code is written in such a way as to avoid reprocessing the mentioned R objects if they are already located in the mentioned folder.
An alternative route to run the code would be to run (or knitr) all .R (and .Rmd) files in order (follow the numbers in the files names). Note: through this alternative route, the final_paper.pdf document will not be generated automatically.