Developing a machine learning model to predict hotel prices based on factors such as star rating, review rating, and amenities offered by the hotel. By collecting and analyzing data from https://www.trivago.com/, the model will explore the relationships and correlations among these factors to accurately forecast hotel room prices.
Tool | Description |
---|---|
Selenium | Automated web scraping tool used to extract data from the Trivago website by locating elements using ID, class name, and XPath, and extracting text and attribute values. |
PyCharm | IDE used for developing Python scripts for data extraction, cleaning, and analysis. |
Microsoft Azure | Cloud platform utilized to establish a data pipeline for effectively managing and securely storing scraped data in Azure Blob Storage. |
REST API Postman | Tool used to create and test REST API endpoints, facilitating practical, real-time integration and application of the machine learning model. |
-
Hotel_Data: This is the starting point, representing the input dataset that contains hotel-related information from scrapped_HotelsDataset.csv.
-
Clean Missing Data (clean_missing_data1): In the Hotel Rating column, the entire row which contains missing values is removed from the dataset. This ensures that only complete cases are used in subsequent analysis steps.
-
Summarize Data:The data is summarized to understand its characteristics, such as mean, median, mode, and other descriptive statistics. This helps in gaining insights into the dataset before further processing.
-
Clean Missing Data (clean_missing_data_2): In the column Review Rating null values are replaced by the median of the column.
-
Split Data: splits the dataset into two parts: 70% for training the model and 30% for testing it. This is important to validate the model's performance on unseen data.
-
Poisson Regression: sets up the model to be used for analysis. Poisson regression is suitable for modeling count data and is often used in scenarios like predicting the number of bookings or events.
-
Train Model: The model is trained using the training dataset. This step involves adjusting the model parameters to minimize error and improve predictions.
-
Select Columns in Dataset: It involves selecting specific features or columns from the dataset that are relevant for model training. It helps in focusing the model on important data points.
-
Score Model: After training, the model is evaluated on the test dataset to score its performance. This typically involves calculating metrics like accuracy, precision, recall, etc.
-
Evaluate Model: Finally, this step evaluates the overall performance of the algorithm based on the scores from the test dataset. This is essential for understanding how well the model is likely to perform in real-world scenarios.
-
Selected Features: Retained columns up to "Review Rating" with significant correlations.
-
Correlations :Price, Hotel Rating, Pool, Hotel bar, Spa, Restaurant, Parking, Free WiFi , A/C, Review Rating.
-
Dropped Features: Removed columns with low or no correlation : WiFi in lobby, Pets, Hotel Name.
- Lower values of MAE and RMSE indicate better model performance.
- Higher values of R² indicate a better fit of the model to the data.
-
Cleaning the Data: Removing or imputing missing values, and correcting inconsistencies in the data . Feature Selection: Selecting the most relevant features using techniques like Pearson correlation analysis and dropping irrelevant features.
-
Model Selection: Experimenting with different algorithms (e.g., linear ,Boosted Decision tree , Decision tree and poisson ) to find the best-performing one.
-
Poisson Regression showed the best performance with a notable increase in accuracy with 62%.
The JSON input sent via Postman includes details of the Hotel name, hotel rating (3.0), review rating (7.1), and available amenities (WiFi, restaurant, etc.). The specified price for this hotel is 4479.
The JSON output received from the REST API contains the same hotel details and amenities, with an
additional "Scored Labels" field showing the model-predicted price 4473.08, providing a comparison
to the actual price.