diff --git a/.gitignore b/.gitignore index f95b0f1..a109ca8 100644 --- a/.gitignore +++ b/.gitignore @@ -4,4 +4,11 @@ .RData **gen/ **.Rout -*.csv \ No newline at end of file +*.csv +**pdf +.Rproj.user +**datafiles1.txt +**datafiles.txt +**.DS_Store +**.RDataTmp +**.RDataTmp1 \ No newline at end of file diff --git a/Makefile b/Makefile new file mode 100644 index 0000000..d5528d4 --- /dev/null +++ b/Makefile @@ -0,0 +1,7 @@ +all: data-preparation analysis + +data-preparation: + make -C src/data-preparation + +analysis: + make -C src/analysis \ No newline at end of file diff --git a/README.md b/README.md index 7c81417..70d21ea 100644 --- a/README.md +++ b/README.md @@ -1,23 +1,98 @@ -# Example of reproducible research workflow +# Weekday vs. weekend: is there still a difference in Airbnb prices? -This is a basic example repository using Gnu make for a reproducible research workflow, as described in detail here: [tilburgsciencehub.com](http://tilburgsciencehub.com/). +airbnb-678x381-1 -The main aim of this to have a basic structure, which can be easily adjusted to use in an actual project. In this example project, the following is done: -1. Download and prepare data -2. Run some analysis -3. Present results in a final pdf generated using LaTeX +## Motivation +Short term weekday stays are becoming increasingly popular in the U.S (Chipkin, 2022). Demand for Tuesday night stays grew 5% from 2019 to 2021; Wednesdays came in a close second, followed by Mondays and Thursdays. In the past Airbnb hosts were quickly inclined to lower their prices for renting the Airbnb during the week, while instead, they maybe could increase prices. Currently, Airdna (2022) claims that it is an ideal time to optimize the pricing strategy for Airbnb hosts. Especially, for the weekday stays. -## Dependencies -- R -- R packages: `install.packages("stargazer")` -- [Gnu Make](https://tilburgsciencehub.com/get/make) -- [TeX distribution](https://tilburgsciencehub.com/get/latex/?utm_campaign=referral-short) -- For the `makefile` to work, R, Gnu make and the TeX distribution (specifically `pdflatex`) need to be made available in the system path -- Detailed installation instructions can be found here: [tilburgsciencehub.com](http://tilburgsciencehub.com/) +In this research, prices from short term stays during the weekd and weekends will be compared. From the top 25 most popular Airbnb cities in the U.S.(Airdna, 2019), the following cities will be analyzed: Portland, San Francisco, Denver, Los Angeles, New York. These cities are spread all over the U.S, and by gathering and analyzing data of these 5 cities, a good representation of the whole U.S. is given. There is a possibility that the roomtype (private room, entire home/apartment, shared room or hotel room) has an impact on trend. +In Europe, there are no sources found that confirm nor deny that the popularity of weekday stays has an impact on the pricing of Airbnb's. For that reason, the top 5 Airbnb cities in Europe, will also be analyzed: Munich, Milan, Paris, London and Dublin (Airbnb: These Are Europe’s Most Profitable Cities, n.d.). In the end, the U.S. and Europe will be compared to see the differences between both Europe and U.S.. The general question for this study project is as follows: -## Notes -- `make clean` removes all unncessary temporary files. -- Tested under Linux Mint (should work in any linux distro, as well as on Windows and Mac) -- IMPORTANT: In `makefile`, when using `\` to split code into multiple lines, no space should follow `\`. Otherwise Gnu make aborts with error 193. -- Many possible improvements remain. Comments and contributions are welcome! +**“*To what extent does the day of the week (weekday vs. weekend) impact pricing of Airbnb? And does this significantly differ per roomtype, and does this significantly differ between the cities (top 5 cities U.S. vs. top 5 cities Europe)?*”** + + +## Repository overview +```bash +├── README.md +├── gen +│ └── analysis +│ └── output +└── src + ├── analysis + └── data-preparation +``` + +## Required software / programs +To run the file you must have installed to following programs: +- [R and R-studio](https://tilburgsciencehub.com/building-blocks/configure-your-computer/statistics-and-computation/r/) +- [Make](https://tilburgsciencehub.com/building-blocks/configure-your-computer/automation-and-workflows/make/) +- [Git Bash](https://gitforwindows.org/) (windows user) of terminal (mac user) + +## Required packages +To run the entire file, a number of packages need to be installed, prior to running the makefile. +- install.packages("tidyverse") +- install.packages("data.table") +- install.packages("afex") +- install.packages("lmrTest") +- install.packages("postHoc") +- install.packages("car") +- install.packages("effectsize") +- install.packages("emmeans") + +## How to run the project: +1) Clone the project to your local computer by:\ + a) Copying the code url\ + b) Opening a terminal/command prompt\ + c) Typing: git clone (insert: code url) +2) Cd to directory where the clone is located --> type: cd What-happens-to-AirBnB-pricing-on-weekdays-vs-weekends/ +3) When in the root directory --> type: make -n + +It should show: +- make -C src/data-preparation +- make -C src/analysis +4) Type: make +5) The entire project should start running from the terminal/command prompt + +Sidenotes: + +* Make has to be installed in order for it to work. +* R should be able to be run from the terminal/command prompt +* It can take some time fo the whole project to run. +* Make sure you are in the correct directory. + +## Research method +To answer the researuch question, multiple Airbnb datasets from [Inside Airbnb](http://insideairbnb.com/get-the-data/) are combined to one dataset. The dataset contains data from 10 cites in total, 5 from the U.S. and 5 from Europe. This dataset is cleaned and prepared for anlyses, because lots of unformation was not needed to answer the research question. For more information about this read: [/src/data-preparation/README_data_preparation.md](https://github.com/course-dprep/What-happens-to-AirBnB-pricing-on-weekdays-vs-weekends/blob/master/src/data-preparation/README_data_preparation.md) + +**Conceptual model:** + +![image](https://user-images.githubusercontent.com/112823109/195831134-55df6bd7-c7eb-4388-b0e6-b1bc8b94fa46.png) + +**Variables of conceptual model:** +```bash +1. wDay: computed variable of weekdays (Monday, Tuesday, Wednesday, Thursday, Sunday) vs. weekend (Friday, Saturday) +2. Room_type: Private room, entire home/ apartment, shared room or hotel +3. City: Top 5 most popular Airbnb cities in the U.S. and in Europe seperatly +4. Price: this is the price of the roomtype on a random day during the week or during the weekend +``` + +## Conclusion +Based on the previous result section, the following conclusions can be drawn for the hypothesized relation. There is no significant effect in the difference of the price between weekend days and weekdays. The average price between weekdays and weekend days does differ for cities in Europe, but this difference is very small. However there are two interaction effects: between weekdays vs. weekend days and room type on price, and between weekdays vs. weekend days and city on price. + +Despite the conclusion of the hypothesis, it is critical to keep in mind that the size of the effect was very tiny in all statistical tests. This means that these results should be interpreted with caution. + +For more detailed information about the findings of the analyses, read: [/gen/analysis/output/README_analysis_conclusion.md](https://github.com/course-dprep/What-happens-to-AirBnB-pricing-on-weekdays-vs-weekends/blob/master/gen/analysis/output/README_analysis_conclusion.md) + +### Authors +This is the repository for the course [Data Preparation and Workflow Management](https://dprep.hannesdatta.com/) at Tilburg University as part of the Master's program Marketing Analytics, used for the team project of group 2. + +- Bo de Ruijter, b.deruijter@tilburguniversity.edu +- Pepijn de Vries, p.j.devries@tilburguniversity.edu +- Amber Pullens, a.pullens@tilburguniversity.edu +- Anouk Lamers, a.j.f.lamers@tilburguniversity.edu +- Caroline Bloemendaal, c.a.bloemendaal@tilburguniversity.edu + +### Resources +- *5 Airbnb Guest Trends to Watch in 2022.* (n.d.). Retrieved October 4, 2022, from https://www.airdna.co/blog/5-airbnb-guest-trends-to-watch +- *Weekday US Hotel Occupancy Hits Pandemic-Era High.* (2022, June 20). Retrieved October 4, 2022, from https://www.businesstravelexecutive.com/news/weekday-us-hotel-occupancy-hits-pandemic-era-high +- *Airbnb: These are Europe’s most profitable cities.* (n.d.). TravelDailyNews International. Retrieved October 11, 2022, from https://www.traveldailynews.com/post/airbnb-these-are-europes-most-profitable-cities diff --git a/data/dataset1/readme.txt b/data/dataset1/readme.txt deleted file mode 100644 index 2234244..0000000 --- a/data/dataset1/readme.txt +++ /dev/null @@ -1,36 +0,0 @@ -========================================================== - D A T A S E T D E S C R I P T I O N -========================================================== - -Name of the dataset: dataset1 - ----------------------------------------------------------- - -1. Motivation of data collection (why was the data collected?) - -Example data for rudimentary example of working with build tools. - -2. Composition of dataset (what's in the data?) - -Simulated data. - -3. Collection process (how was the data collected?) - -Data is downloaded from web. See src/data-preparation/download_data.R. - -4. Preprocessing/cleaning/labeling (how was the data cleaned, if at all?) - -No preprocessing. All done in data-preparation stage. - -5. Uses (how is the dataset intended to be used?) - -Used only in example workflow. - -6. Distribution (how will the dataset be made available to others?) - -Raw data available for download for others. - -7. Maintenance (will the dataset be maintained? how? by whom?) - -None. - diff --git a/data/dataset2/readme.txt b/data/dataset2/readme.txt deleted file mode 100644 index b800ed2..0000000 --- a/data/dataset2/readme.txt +++ /dev/null @@ -1,36 +0,0 @@ -========================================================== - D A T A S E T D E S C R I P T I O N -========================================================== - -Name of the dataset: dataset2 - ----------------------------------------------------------- - -1. Motivation of data collection (why was the data collected?) - -Example data for rudimentary example of working with build tools. - -2. Composition of dataset (what's in the data?) - -Simulated data. - -3. Collection process (how was the data collected?) - -Data is downloaded from web. See src/data-preparation/download_data.R. - -4. Preprocessing/cleaning/labeling (how was the data cleaned, if at all?) - -No preprocessing. All done in data-preparation stage. - -5. Uses (how is the dataset intended to be used?) - -Used only in example workflow. - -6. Distribution (how will the dataset be made available to others?) - -Raw data available for download for others. - -7. Maintenance (will the dataset be maintained? how? by whom?) - -None. - diff --git a/gen/analysis/audit/.gitkeep b/gen/analysis/audit/.gitkeep deleted file mode 100644 index e69de29..0000000 diff --git a/gen/analysis/input/.gitkeep b/gen/analysis/input/.gitkeep deleted file mode 100644 index e69de29..0000000 diff --git a/gen/analysis/output/README_analysis_conclusion.md b/gen/analysis/output/README_analysis_conclusion.md new file mode 100644 index 0000000..b123732 --- /dev/null +++ b/gen/analysis/output/README_analysis_conclusion.md @@ -0,0 +1,48 @@ + +# **Analysis & interpretation** + +## **Checking the assumptions ANOVA** +The homogeneity of variance, normality of the distribution, and independence of observations are three assumptions that need to be verified in order to determine whether an ANOVA analysis can be conducted. A random sample of 5000 observations for this evaluation was generated. + +**Homogeneity of variance**\ +*City*\ +For the interaction effect between wDay and city, it can be concluded that there are no equal variances in , since the Levene’s Test gives a p-value that is lower 0.05. Also for the direct effect of the city on price, the Levene’s test gives a p-value below 0.05. + + +*Room type*\ +Also for the interaction effect between wDay and room_type, it can be concluded that there are no equal variances in , since the Levene’s Test gives a p-value that is lower 0.05. Also for the direct effect of room_type on price, the Levene’s test gives a p-value below 0.05. + +As a result, the homogeneity is violated. However, this is not a problem for conducting and interpreting the ANOVA analyses since there is a large sample size used. + +**Normality of the distribution**\ +From the Shapiro Wilk normality test we can conclude that the variable in the sample is not normally distributed, since it has a smaller p-value than 0.05. As a result, the normality is violated for all variables, however, this is not a problem for conducting and interpreting the ANOVA analyses since there is a large sample size used. + +**Independence of observations**\ +When the sample is chosen at random, the experiment is set up properly and therefore the independence of observations can be achieved. The function ‘sample_n’ is used to collect 5000 random observations in a new data frame. Therefore, the ANOVA analyses can be conducted. + +## **ANOVA Analyses** +There have been several ANOVA analyses conducted to address the research question *“to what extent does the day of the week (weekday vs. weekend) impact pricing of Airbnb? And does this significantly differ per room type, and does this significantly differ between the cities (top 5 cities U.S. vs. top 5 cities Europe)?”* + +In this section, short descriptions of the findings are given. + +- **ANOVA price and wDAY**\ +From the ANOVA it can be concluded that there is no significant relationship between the variable wDay and price (p = 0.811)(anova_wDay_summary.txt). This means that there is no significant difference between weekdays and weekend days on the price. +- **ANOVA price and room_type**\ +From the ANOVA it can be concluded that there is a significant relationship between the variable room_type and price (p<0,001)(anova_room_type_summary.txt). This means that there is a significant difference between the different room types on the price. To get more insights about the size of the effect, there is a test conducted, to know the eta squared. The eta squared is very low, so from that it can be concluded that the effect is very small. +- **ANOVA price and city**\ +From the ANOVA it can be concluded that there is a significant relationship between the variable city and price (p<0,001)(anova_city_summary. This means that there is a significant difference between the different cities on the price. To get more insights about the size of the effect, there is a test conducted, to know the eta squared. The eta squared is 0.02, which means that there is a small to medium effect. +- **ANOVA with interaction room_type*wDay**\ +From the ANOVA with the interaction effect between room_type and wDay on price, the conclusion is that there is a significant relationship between this interaction variable and the price, since the p-value is very low (p < 0,001)(mod_roomtype_wDay_interaction_results.txt). This leads to the conclusion that the difference in the effect of weekdays vs. weekend days on price, depends on the room type. However, this effect is not very big, since the eta squared is very low. To get more insights in the difference in room_types, a Tukey test was performed. From the results it can be concluded that the price for shared and private rooms is much lower. The price for a hotel room and an entire home/apartment have the highest price. +- **ANOVA with interaction city*wDay**\ +From the ANOVA with the interaction between city and wDay, the conclusion is that there is a significant relationship between this interaction variable and the price (mod_city_wDay_interaction_results.txt). This leads to the conclusion that the effect of the wDay on the price did significantly differ between different cities. However, this effect is very small since the eta squared is 0.02. To get more insights a Tukey test was performed. It can be concluded from this test that the price of the US city San Francisco and the price of the European city Milan are the highest. +- **Difference in price Wday in U.S.**\ +The average price of Airbnb’s during the week for cities in the United States is 285. This average price does not differ from weekend days. So with this it can be concluded that the price for cities in the United States does not differ across weekend days and weekdays. +- **Difference in price Wday Europe**\ +The average price of Airbnb’s during the week for cities in Europe is 186. This average price is slightly higher than the average price for weekend days. The average price for Airbnb’s on weekdays in Europe is namely 175. So there is a small difference between the price on the weekend and during the week. +- **Differences between weekends and weekdays visualized**\ +![plot_eu_cities](https://user-images.githubusercontent.com/111459511/196128650-7cb88d6b-fdf4-42c5-9bf9-1c4b41a71ca4.png) +![plot_us_cities](https://user-images.githubusercontent.com/111459511/196128706-0f1932de-9004-4f6d-8d69-722a23f89212.png) + + + + diff --git a/gen/analysis/temp/.gitkeep b/gen/analysis/temp/.gitkeep deleted file mode 100644 index e69de29..0000000 diff --git a/gen/data-preparation/audit/.gitkeep b/gen/data-preparation/audit/.gitkeep deleted file mode 100644 index e69de29..0000000 diff --git a/gen/data-preparation/input/.gitkeep b/gen/data-preparation/input/.gitkeep deleted file mode 100644 index e69de29..0000000 diff --git a/gen/data-preparation/output/.gitkeep b/gen/data-preparation/output/.gitkeep deleted file mode 100644 index e69de29..0000000 diff --git a/gen/data-preparation/temp/.gitkeep b/gen/data-preparation/temp/.gitkeep deleted file mode 100644 index e69de29..0000000 diff --git a/gen/paper/audit/.gitkeep b/gen/paper/audit/.gitkeep deleted file mode 100644 index e69de29..0000000 diff --git a/gen/paper/input/.gitkeep b/gen/paper/input/.gitkeep deleted file mode 100644 index e69de29..0000000 diff --git a/gen/paper/output/.gitkeep b/gen/paper/output/.gitkeep deleted file mode 100644 index e69de29..0000000 diff --git a/gen/paper/temp/.gitkeep b/gen/paper/temp/.gitkeep deleted file mode 100644 index e69de29..0000000 diff --git a/makefile b/makefile deleted file mode 100644 index 9caf52a..0000000 --- a/makefile +++ /dev/null @@ -1,51 +0,0 @@ -# Notes: -# - If run on unix system, use rm instead of del command in clean -# - Careful with spaces! If use \ to split to multiple lines, cannot have a space after \ - -# OVERALL BUILD RULES -all: data_cleaned results paper -paper: gen/paper/output/paper.pdf -data_cleaned: gen/data-preparation/output/data_cleaned.RData -results: gen/analysis/output/model_results.RData -.PHONY: clean - -# INDIVIDUAL RECIPES - -# Generate paper/text -gen/paper/output/paper.pdf: gen/paper/output/table1.tex \ - src/paper/paper.tex - pdflatex -interaction=batchmode -output-directory='gen/paper/output/' 'src/paper/paper.tex' - pdflatex -interaction=batchmode -output-directory='gen/paper/output/' 'src/paper/paper.tex' - pdflatex -output-directory='gen/paper/output/' 'src/paper/paper.tex' -# Note: runs pdflatex multiple times to have correct cross-references - -# Generate tables -gen/paper/output/table1.tex: gen/analysis/output/model_results.RData \ - src/paper/tables.R - Rscript src/paper/tables.R - -# Run analysis -gen/analysis/output/model_results.RData: gen/data-preparation/output/data_cleaned.RData \ - src/analysis/analyze.R - Rscript src/analysis/update_input.R - Rscript src/analysis/analyze.R - -# Clean data -gen/data-preparation/output/data_cleaned.RData: data/dataset1/dataset1.csv \ - data/dataset2/dataset2.csv \ - src/data-preparation/merge_data.R \ - src/data-preparation/clean_data.R - Rscript src/data-preparation/update_input.R - Rscript src/data-preparation/merge_data.R - Rscript src/data-preparation/clean_data.R - -# Download data -data/dataset1/dataset1.csv data/dataset2/dataset2.csv: src/data-preparation/download_data.R - Rscript src/data-preparation/download_data.R - -# Clean-up: Deletes temporary files -# Note: Using R to delete files keeps platform-independence. -# --vanilla option prevents from storing .RData output -clean: - Rscript --vanilla src/clean-up.R - diff --git a/src/analysis/Makefile b/src/analysis/Makefile new file mode 100644 index 0000000..419cfa4 --- /dev/null +++ b/src/analysis/Makefile @@ -0,0 +1,6 @@ +AOUTPUT = ../../gen/analysis/output + +all: $(AOUTPUT)/plot_eu_cities.png + +$(AOUTPUT)/plot_eu_cities.png: analyze.R ../data-preparation/cleaned_dataset.csv + R --vanilla < analyze.R \ No newline at end of file diff --git a/src/analysis/analyze.R b/src/analysis/analyze.R index 776cd15..eac996f 100644 --- a/src/analysis/analyze.R +++ b/src/analysis/analyze.R @@ -1,11 +1,125 @@ -# load -load("./gen/analysis/input/data_cleaned.RData") +# Load the R-packages +library(readr) +library(dplyr) +library(stringr) +library(tidyr) +library(data.table) +library(ggplot2) +library(afex) +library(lmerTest) +library(postHoc) +library(car) +library(effectsize) +library(emmeans) -# Estimate model 1 -m1 <- lm(V1 ~ V3 + V4,df_cleaned) +# Import the cleaned data +cleaned_dataset <- read_csv("cleaned_dataset.csv") -# Estimate model 2 -m2 <- lm(V1 ~ V3 + V4 + V5 , df_cleaned) +data_airbnb_ANOVA <- sample_n(cleaned_dataset, 5000) -# Save results -save(m1,m2,file="./gen/analysis/output/model_results.RData") \ No newline at end of file +# Homoscedasticity +## city +leveneTest(data_airbnb_ANOVA$price ~ interaction(data_airbnb_ANOVA$city, data_airbnb_ANOVA$wDay), center=mean) +leveneTest(data_airbnb_ANOVA$price ~ interaction(data_airbnb_ANOVA$united_states, data_airbnb_ANOVA$wDay), center=mean) +leveneTest(data_airbnb_ANOVA$price ~ interaction(data_airbnb_ANOVA$europe, data_airbnb_ANOVA$wDay), center=mean) +leveneTest(price ~ city, data_airbnb_ANOVA, center=mean) + +## roomtype +leveneTest(data_airbnb_ANOVA$price ~ interaction(data_airbnb_ANOVA$room_type, data_airbnb_ANOVA$wDay), center=mean) +leveneTest(price ~ room_type, data_airbnb_ANOVA, center=mean) + +# Normality +shapiro.test(data_airbnb_ANOVA$price) + +# One-way ANOVA's with wDay, room_type and city as independent variable and price as dependent variable +anova_wDay <- aov(price ~ wDay, data_airbnb_ANOVA) +anova_wDay_summary <- summary(anova_wDay) + +# Save anova_wDay +capture.output(anova_wDay_summary, file = "../../gen/analysis/output/anova_wDay_summary.txt") + +anova_room_type <- aov(price ~ room_type, data_airbnb_ANOVA) +anova_room_type_summary <- summary(anova_room_type) + +# Save anova_room_type +capture.output(anova_room_type_summary, file = "../../gen/analysis/output/anova_room_type_summary.txt") + +anova_city <- aov(price ~ city, data_airbnb_ANOVA) +anova_city_summary <- summary(anova_city) + +# Save anova_city +capture.output(anova_city_summary, file = "../../gen/analysis/output/anova_city_summary.txt") + +## room_type moderator +mod_room_type_wDay <- aov(data_airbnb_ANOVA$price ~ interaction(data_airbnb_ANOVA$room_type, data_airbnb_ANOVA$wDay)) +mod_room_type_wDay_summary <- summary(mod_room_type_wDay) + +# Save output +capture.output(mod_room_type_wDay_summary, file = "../../gen/analysis/output/mod_roomtype_wDay_interaction_results.txt") + +# Effect size for the ANOVA's +eta_squared(anova_wDay, ci=0.95, partial = TRUE) + +eta_squared(anova_room_type, ci=0.95, partial = TRUE) + +eta_squared(anova_city, ci=0.95, partial = TRUE) + +# Moderation effect of city and room_type +## city +mod_city_wDay <- aov(data_airbnb_ANOVA$price ~ interaction(data_airbnb_ANOVA$city, data_airbnb_ANOVA$wDay)) +mod_city_wDay_summary <- summary(mod_city_wDay) + +# Save output +capture.output(mod_city_wDay_summary, file = "../../gen/analysis/output/mod_city_wDay_interaction_results.txt") + +# Tukey tests for moderation effect +TukeyHSD(mod_room_type_wDay) +TukeyHSD(mod_city_wDay) + +# Effect size of the ANOVAs with moderation effect +eta_squared(mod_room_type_wDay, ci=0.95, partial = TRUE) +eta_squared(mod_city_wDay, ci=0.95, partial = TRUE) + +# Difference in average price between weekdays and weekend days in United States +data_airbnb_ANOVA %>% + filter(united_states == 'TRUE') %>% + group_by(wDay) %>% + summarize(mean_price = mean(price)) + +# Difference in average price between weekdays and weekend days in Europe +data_airbnb_ANOVA %>% + filter(europe == 'TRUE') %>% + group_by(wDay) %>% + summarize(mean_price = mean(price)) + +# Difference between average price in weekend and during the week for cities in United States +data_airbnb_ANOVA %>% + filter(united_states == 'TRUE') %>% + group_by(city, wDay) %>% + summarize(mean_price = mean(price)) + +# Difference between average price in weekend and during the week for cities in Europe +data_airbnb_ANOVA %>% + filter(europe == 'TRUE') %>% + group_by(city, wDay) %>% + summarize(mean_price = mean(price)) + +data_airbnb_ANOVA_uscities <- cleaned_dataset %>% filter(united_states == TRUE) +data_airbnb_ANOVA_uscities$wDay <- as.numeric(data_airbnb_ANOVA_uscities$wDay) +dt_price_uscities <- as.data.table(data_airbnb_ANOVA_uscities) +plot_price_uscities <- dt_price_uscities[, .(mean_price = mean(price)), + by = .(wDay, city)] + +data_airbnb_ANOVA_eucities <- cleaned_dataset %>% filter(united_states == FALSE) +data_airbnb_ANOVA_eucities$wDay <- as.numeric(data_airbnb_ANOVA_eucities$wDay) +dt_price_eucities <- as.data.table(data_airbnb_ANOVA_eucities) +plot_price_eucities <- dt_price_eucities[, .(mean_price = mean(price)), + by = .(wDay, city)] + +#Barplot United States +ggplot(plot_price_uscities, aes(x = wDay, y =mean_price)) + geom_bar(stat = "identity") + facet_wrap(~ city) +ggsave(filename = "../../gen/analysis/output/plot_us_cities.png", width = 15, height = 6, dpi = 100, units = "cm") + +#Barplot Europe +ggplot(plot_price_eucities, aes(x = wDay, y =mean_price)) + geom_bar(stat = "identity") + facet_wrap(~ city) +ggsave(filename = "../../gen/analysis/output/plot_eu_cities.png", width = 15, height = 6, dpi = 100, units = "cm") diff --git a/src/analysis/update_input.R b/src/analysis/update_input.R deleted file mode 100644 index e638bc3..0000000 --- a/src/analysis/update_input.R +++ /dev/null @@ -1,4 +0,0 @@ -# Copy output from data-preparation into input folder -# This step really depends no how files are shared across the different stages (e.g. if whole pipeline -# is on a single machine, could directly access output folder from prepaaration stage) -file.copy("./gen/data-preparation/output/data_cleaned.RData","./gen/analysis/input/data_cleaned.RData") diff --git a/src/clean-up.R b/src/clean-up.R deleted file mode 100644 index 1c54635..0000000 --- a/src/clean-up.R +++ /dev/null @@ -1,16 +0,0 @@ -# Deletes files in all subdirectories with the endings specified here -fileEndings <- c('*.log','*.aux','*.Rout','*.Rhistory','*.fls','*.fdb_latexmk') -for (fi in fileEndings) { - files <- list.files(getwd(),fi,include.dirs=F,recursive=T,full.names=T,all.files=T) - file.remove(files) -} - -# Delete all files in temp directories -# (does note delete hidden files starting with . (e.g. .gitkeep is not deleted)) -unlink(paste(getwd(),'/gen/analysis/temp/*',sep=''),recursive=T,force=T) -unlink(paste(getwd(),'/gen/data-preparation/temp/*',sep=''),recursive=T,force=T) -unlink(paste(getwd(),'/gen/paper/temp/*',sep=''),recursive=T,force=T) - -# Delete temporary (hidden) R files -file.remove('.RData') -file.remove('.Rhistory') \ No newline at end of file diff --git a/src/data-preparation/README_data_preparation.md b/src/data-preparation/README_data_preparation.md new file mode 100644 index 0000000..c803902 --- /dev/null +++ b/src/data-preparation/README_data_preparation.md @@ -0,0 +1,62 @@ +# **Data exploration & preparation** + +## **1. Data exploration** +### **1.1 Explore raw data using summary statistics** + +**Statics observations per continent** +The original dataset (raw data) contains 6,338,415 observations divided over two continents: United States and Europe. + +image + +\ +\ +The original dataset contains 3,839,056 observations in Europe. The observations of Europe are spread over 5 different cities.  +image + +image + +\ +\ +The original dataset contains 2,499,359 observations in the United States. The observations of the United States are spread over 5 different cities. +image + +image + +### **1.2 Detect the origin of missing values** +The missing values are checked via the code: colSums(is.na(cleaned_dataset)). The output indicated that there were no missing values (NAs). This was also expected, because Airbnb data is generated automatically. People who rent out their home or apartment on Airbnb do not have to enter any information by themselves. + +## **2. Data preparation** +### **2.1 Create new variables as reference to short term stays** + +As the data is collected per day with different duration, a new variable named "short_term_stays" is created. With this new variable, only stays with minimum nights stay of 1, 2, 3 or 4 are showed. + +### **2.2 Delete unnecessary columns** +As this research focuses on the main question *"To what extent does the day of the week (weekday vs. weekend) impact pricing of Airbnb? And does this significantly differ per room type, and does this significantly differ between the cities (top 5 cities U.S. vs. top 5 cities Europe)?"*, some variables in the original dataset might not be needed to come to an answer. For that reason only the following variables are kept in the dataset to maintain a clear overview of the variables needed for this research: + +- **weekdag**: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday +- **wDay**: computed variable of weekdays (Monday, Tuesday, Wednesday, Thursday, Sunday) vs. weekend (Friday, Saturday) +- **short_term_stays**: 1, 2, 3 or 4 nights +- **room_type**: Private room, entire home/ apartment, shared room or hotel +- **city**: city in which the Airbnb is listed → Top 5 most popular Airbnb cities in the U.S. and top 5 most popular Airbnb cities in Europe +- **united_states**: Portland, San-Francisco, Denver, Los Angeles, New-York +- **europe**: Munich, Milan, Paris, London, Dublin +- **price**: this is the price of the room type on a random day during the week or during the weekend +- **date**: the date is used to check if which price the hosts asks on a certain date + +### **2.3 Chang the variable "price" to numeric** +The data exploration revealed that the variable "price" in this data is imported as a character. For further analysis, the variable "price" must be converted to a numeric value. Before this could be done, the dollar sign and comma for prices greater than a thousand had to be removed. + +### **2.4 Detect outliers in the data** +From the statistics (boxplots and ggplots) used to detect outliers, it was observed that there are a few cases in which outliers seem to exist. + + +**Outliers within the variable "minimum_nights"**\ +From the boxplot it is observed that there are 26 extreme outliers, which have a minimum stay of 9999 nights. All these outliers are from one listing ID. This is remarkable because the mean including this outlier is 33,7. By filtering on “minimum_nights” below 5, it is observed that 3,791.936 observations have the possibility of a short term stay. + +**Outliers within the variable "price"**\ +The boxplot shows that there are some cases for which the price is €0.00. There are 1525 outliers which have a price of €0.00 per night. The data indicates that it is about observations of which multiple have the same id, but relate to different days during the week. + +Only 0.02% of the dataset are outliers, which is in perspective very small, and will therefore not have any impact on the results of the dataset. + +## **3. After cleaning** +The cleaned dataset (cleaned_dataset) contains approximately 6 million observations and 10 variables to work with. The cleaned dataset contains information about 10 different cities. diff --git a/src/data-preparation/clean_data.R b/src/data-preparation/clean_data.R deleted file mode 100644 index 4eadde3..0000000 --- a/src/data-preparation/clean_data.R +++ /dev/null @@ -1,11 +0,0 @@ -# Load merged data -load("./gen/data-preparation/temp/data_merged.RData") - -# Drop observations with V1 <= -0.9 -df_cleaned <- df_merged[df_merged$V1 > -0.9,] - -# Remove V1 -df_cleaned <- df_cleaned[,c(1,2,4:7)] - -# Save cleaned data -save(df_cleaned,file="./gen/data-preparation/output/data_cleaned.RData") diff --git a/src/data-preparation/cleaning_file.r b/src/data-preparation/cleaning_file.r new file mode 100644 index 0000000..684240e --- /dev/null +++ b/src/data-preparation/cleaning_file.r @@ -0,0 +1,47 @@ +library(tidyverse) + +calender_data <- read.csv("calender_data.csv") +listing_data <- read.csv("listing_data.csv") + +merged_data <- calender_data %>% + left_join(listing_data, by = c("listing_id" = "id")) + +cleaned_dataset <- na.omit(merged_data) + +cleaned_dataset$date <- as.Date(cleaned_dataset$date) + +cleaned_dataset$day_num <- format(cleaned_dataset$date,"%u") +cleaned_dataset$day_num <- as.numeric(cleaned_dataset$day_num) + +cleaned_dataset$weekdag <- weekdays(cleaned_dataset$date) + + +weekdays1 <- c(1, 2, 3, 4, 7) +cleaned_dataset$wDay <- factor(((cleaned_dataset$day_num) %in% weekdays1), levels=c(FALSE, TRUE), labels=c('weekend', 'weekday')) + + +# We still need to change the price variable into a variable that stores the numeric value of price without the dollar sign +cleaned_dataset$price <- parse_number(cleaned_dataset$price) + +## Make dummy variable of variable wDay +cleaned_dataset$wDay <- ifelse(cleaned_dataset$wDay == "weekday", 1,0) + +## Create new column for short-term stays +short_term_stays <- c('1', '2', '3', '4') +cleaned_dataset$short_term_stays <- factor(cleaned_dataset$minimum_nights %in% short_term_stays) + +## Create new column for cities in U.S. and Europe +united_states <- c('denver', 'portland', 'san-francisco', 'los-angeles', 'new-york-city') +cleaned_dataset$united_states <- factor(cleaned_dataset$city %in% united_states) + +europe <- c('munich', 'london', 'paris', 'milan', 'dublin') +cleaned_dataset$europe <-factor(cleaned_dataset$city %in% europe) + +# Delete columns that are not needed for analyses +cleaned_dataset <- cleaned_dataset %>% + dplyr::select(5,7,9,11,12,13,14,15,16,17) + +write.csv(cleaned_dataset, "cleaned_dataset.csv") + +file.copy(from="cleaned_dataset.csv", to='../../src/analysis') +# File copy needs to be copied towards the gen/analysis/input \ No newline at end of file diff --git a/src/data-preparation/compiling_file.r b/src/data-preparation/compiling_file.r new file mode 100644 index 0000000..ffbf1d2 --- /dev/null +++ b/src/data-preparation/compiling_file.r @@ -0,0 +1,64 @@ +library(tidyverse) +urls_calender = c("http://data.insideairbnb.com/united-states/co/denver/2022-09-26/data/calendar.csv.gz", + "http://data.insideairbnb.com/ireland/leinster/dublin/2022-09-11/data/calendar.csv.gz", + "http://data.insideairbnb.com/united-kingdom/england/london/2022-09-10/data/calendar.csv.gz", + "http://data.insideairbnb.com/united-states/ca/los-angeles/2022-09-09/data/calendar.csv.gz", + "http://data.insideairbnb.com/italy/lombardy/milan/2022-09-14/data/calendar.csv.gz", + "http://data.insideairbnb.com/germany/bv/munich/2022-06-21/data/calendar.csv.gz", + "http://data.insideairbnb.com/united-states/ny/new-york-city/2022-09-07/data/calendar.csv.gz", + "http://data.insideairbnb.com/france/ile-de-france/paris/2022-06-06/data/calendar.csv.gz", + "http://data.insideairbnb.com/united-states/or/portland/2022-09-16/data/calendar.csv.gz", + "http://data.insideairbnb.com/united-states/ca/san-francisco/2022-09-07/data/calendar.csv.gz") +urls_listing = c("http://data.insideairbnb.com/united-states/co/denver/2022-09-26/data/listings.csv.gz", + "http://data.insideairbnb.com/ireland/leinster/dublin/2022-09-11/data/listings.csv.gz", + "http://data.insideairbnb.com/united-kingdom/england/london/2022-09-10/data/listings.csv.gz", + "http://data.insideairbnb.com/united-states/ca/los-angeles/2022-09-09/data/listings.csv.gz", + "http://data.insideairbnb.com/italy/lombardy/milan/2022-09-14/data/listings.csv.gz", + "http://data.insideairbnb.com/germany/bv/munich/2022-06-21/data/listings.csv.gz", + "http://data.insideairbnb.com/united-states/ny/new-york-city/2022-09-07/data/listings.csv.gz", + "http://data.insideairbnb.com/france/ile-de-france/paris/2022-06-06/data/listings.csv.gz", + "http://data.insideairbnb.com/united-states/or/portland/2022-09-16/data/listings.csv.gz", + "http://data.insideairbnb.com/united-states/ca/san-francisco/2022-09-07/data/listings.csv.gz") + +calender_data <- lapply(urls_calender, function(url) { + ds = read_csv(url, n_max = Inf) + city_name = strsplit(url, '/')[[1]][6] + ds = ds %>% mutate(city = city_name) + ds +}) + +# Try random sample: +# calender_data <- lapply(urls_calender, function(url) { +#ds = sampleCSV(url, 5000) +#city_name = strsplit(url, '/')[[1]][6] +#ds = ds %>% mutate(city = city_name) +#ds +#}) + + +calender_data1 <- calender_data[1:3] %>% bind_rows() +sample_calender_data <- sample_n(calender_data1, nrow(calender_data1)/15) +rm(calender_data1) +calender_data2<- calender_data[4:7] %>% bind_rows() +sample_calender_data2 <- sample_n(calender_data2, nrow(calender_data2)/15) +rm(calender_data2) +calender_data3 <- calender_data[8:10] %>% bind_rows() +sample_calender_data3 <- sample_n(calender_data3, nrow(calender_data3)/15) +rm(calender_data3) +rm(calender_data) + +calender_data <- bind_rows(sample_calender_data, sample_calender_data2, sample_calender_data3) +write.csv(calender_data, "calender_data.csv") +rm(sample_calender_data, sample_calender_data2, sample_calender_data3) + +listing_data <- lapply(urls_listing, function(url) { + ds = read_csv(url, col_select = c("id","room_type"), n_max = Inf) + ds +}) + +listing_data <- listing_data %>% bind_rows() +write.csv(listing_data, "listing_data.csv") + +sink('../../data/datafiles1.txt') +cat('done!') +sink() \ No newline at end of file diff --git a/src/data-preparation/download_data.R b/src/data-preparation/download_data.R deleted file mode 100644 index 9fbe43f..0000000 --- a/src/data-preparation/download_data.R +++ /dev/null @@ -1,8 +0,0 @@ -# Download dataset 1 -# dir.create('./data/dataset1') # Uncomment if need to create directory with R -download.file('https://rgreminger.github.io/files/dataset1.csv','./data/dataset1/dataset1.csv') - -# Download dataset 2 -# dir.create('./data/dataset2') # Uncomment if need to create directory with R -download.file('https://rgreminger.github.io/files/dataset2.csv','./data/dataset2/dataset2.csv') - diff --git a/src/data-preparation/download_file.r b/src/data-preparation/download_file.r new file mode 100644 index 0000000..23918e5 --- /dev/null +++ b/src/data-preparation/download_file.r @@ -0,0 +1,46 @@ +# Load packages +library(tidyverse) +#Create data folder +dir.create("../../data") + +# Input +urls_calender = c("http://data.insideairbnb.com/united-states/co/denver/2022-09-26/data/calendar.csv.gz", + "http://data.insideairbnb.com/ireland/leinster/dublin/2022-09-11/data/calendar.csv.gz", + "http://data.insideairbnb.com/united-kingdom/england/london/2022-09-10/data/calendar.csv.gz", + "http://data.insideairbnb.com/united-states/ca/los-angeles/2022-09-09/data/calendar.csv.gz", + "http://data.insideairbnb.com/italy/lombardy/milan/2022-09-14/data/calendar.csv.gz", + "http://data.insideairbnb.com/germany/bv/munich/2022-06-21/data/calendar.csv.gz", + "http://data.insideairbnb.com/united-states/ny/new-york-city/2022-09-07/data/calendar.csv.gz", + "http://data.insideairbnb.com/france/ile-de-france/paris/2022-06-06/data/calendar.csv.gz", + "http://data.insideairbnb.com/united-states/or/portland/2022-09-16/data/calendar.csv.gz", + "http://data.insideairbnb.com/united-states/ca/san-francisco/2022-09-07/data/calendar.csv.gz") +urls_listing = c("http://data.insideairbnb.com/united-states/co/denver/2022-09-26/data/listings.csv.gz", + "http://data.insideairbnb.com/ireland/leinster/dublin/2022-09-11/data/listings.csv.gz", + "http://data.insideairbnb.com/united-kingdom/england/london/2022-09-10/data/listings.csv.gz", + "http://data.insideairbnb.com/united-states/ca/los-angeles/2022-09-09/data/listings.csv.gz", + "http://data.insideairbnb.com/italy/lombardy/milan/2022-09-14/data/listings.csv.gz", + "http://data.insideairbnb.com/germany/bv/munich/2022-06-21/data/listings.csv.gz", + "http://data.insideairbnb.com/united-states/ny/new-york-city/2022-09-07/data/listings.csv.gz", + "http://data.insideairbnb.com/france/ile-de-france/paris/2022-06-06/data/listings.csv.gz", + "http://data.insideairbnb.com/united-states/or/portland/2022-09-16/data/listings.csv.gz", + "http://data.insideairbnb.com/united-states/ca/san-francisco/2022-09-07/data/listings.csv.gz") + +# Transformation and output + +for (url in urls_calender) { + filename = paste(gsub('[^a-zA-Z]', '', url), '.csv') + filename = gsub('httpdatainsideairbnbcom', '', filename) + download.file(url, destfile = paste0('../../data/', filename)) # download file +} +for (url in urls_listing) { + filename = paste(gsub('[^a-zA-Z]', '', url), '.csv') + filename = gsub('httpdatainsideairbnbcom', '', filename) + download.file(url, destfile = paste0('../../data/', filename)) # download file +} + + + + +sink('../../data/datafiles.txt') +cat('done!') +sink() diff --git a/src/data-preparation/makefile b/src/data-preparation/makefile new file mode 100644 index 0000000..3dd77c6 --- /dev/null +++ b/src/data-preparation/makefile @@ -0,0 +1,11 @@ +all: ../../src/cleaned_dataset.csv + + +../../data/datafiles.txt: download_file.r + R --vanilla < download_file.r + +../../data/datafiles1.txt: compiling_file.r ../../data/datafiles.txt + R --vanilla < compiling_file.r + +../../src/cleaned_dataset.csv: cleaning_file.r ../../data/datafiles1.txt + R --vanilla < cleaning_file.r \ No newline at end of file diff --git a/src/data-preparation/merge_data.R b/src/data-preparation/merge_data.R deleted file mode 100644 index 636a7c4..0000000 --- a/src/data-preparation/merge_data.R +++ /dev/null @@ -1,9 +0,0 @@ -# Load datasets into R -df1 <- read.csv("./gen/data-preparation/input/dataset1.csv") -df2 <- read.csv("./gen/data-preparation/input/dataset2.csv") - -# Merge on id -df_merged <- merge(df1,df2,by="id") - -# Save merged data -save(df_merged,file="./gen/data-preparation/temp/data_merged.RData") \ No newline at end of file diff --git a/src/data-preparation/update_input.R b/src/data-preparation/update_input.R deleted file mode 100644 index 2f21f1c..0000000 --- a/src/data-preparation/update_input.R +++ /dev/null @@ -1,5 +0,0 @@ -# Copy the raw data into input folder -# This step really depends no how files are shared across the different stages (e.g. if whole pipeline -# is on a single machine, could directly access data from data directory) -file.copy("./data/dataset1/dataset1.csv","./gen/data-preparation/input/dataset1.csv") -file.copy("./data/dataset2/dataset2.csv","./gen/data-preparation/input/dataset2.csv") diff --git a/src/paper/paper.tex b/src/paper/paper.tex deleted file mode 100644 index 8fafebe..0000000 --- a/src/paper/paper.tex +++ /dev/null @@ -1,6 +0,0 @@ -\documentclass{article} -\usepackage{import} % required to use relative path -\begin{document} -This is a test document to show a table. Table \ref{tab:example} is an example of how to display model results from R. -\input{gen/paper/output/table1.tex} -\end{document} \ No newline at end of file diff --git a/src/paper/tables.R b/src/paper/tables.R deleted file mode 100644 index d25d034..0000000 --- a/src/paper/tables.R +++ /dev/null @@ -1,10 +0,0 @@ -# Load results -load("./gen/analysis/output/model_results.RData") - -# Load in additional package to export to latex table -require(stargazer) - -# Export to latex table (omits f-stat since messes up table) -stargazer(m1,m2,out="./gen/paper/output/table1.tex", - title = "Example results", label = "tab:example", - omit.stat=c("f"))