Link to streamlit app:
- The dataset used is about Milwaukee's (US state) Real Estate sales data from the year 2002-2022
- The dataset is obtained from Milwaukee's Open Data Portal
- Refer to the link below to download the dataset
- Link:
Field Name | Description | Type |
PropertyID | A unique identifier for each property. | int64 |
PropType | The type of property (e.g., Commercial or Residential). | Object |
taxkey | The tax key associated with the property. | Object |
Address | The address of the property. | Object |
CondoProject | Information about whether the property is part of a condominium project (NaN indicates missing data). | Object |
District | The district number for the property. | Object |
nbhd | The neighborhood number for the property. | Object |
Style | The architectural style of the property. | Object |
Extwall | The type of exterior wall material used. | Object |
Stories | The number of stories in the building. | float64 |
Year_Built | The year the property was built. | int64 |
Rooms | The number of rooms in the property. | Object |
FinishedSqft | The total square footage of finished space in the property. | int64 |
Units | The number of units in the property (e.g., apartments in a multifamily building). | Object |
Bdrms | The number of bedrooms in the property. | Object |
Fbath | The number of full bathrooms in the property. | Object |
Hbath | The number of half bathrooms in the property. | Object |
Lotsize | The size of the lot associated with the property. | int64 |
Sale_date | The date when the property was sold. | datetime |
Sale_price | The sale price of the property. | int64 |
- To train a regression model to predict the sales price of a real estate property based on different features.
- Target variable: Sale_price (int64).
- From the source, the datasets are separated by year ranging from 2002-2022. Hence, concatenation is required.
- Refer to the file concat.ipynb for more information on the data merging process.
- Changing Sale_date format into YYYY-MM for the 2019-2022 dataset to align with the date format of the 2002-2018 datasets
- Remove missing values (NaN values)
- Dropping insignificant features
- Standardized the data types before merging into a single file
- Skipped data cleaning in this step as the process was performed in the concat.ipynb
- Created new features: year_sold and month_sold
- Finished Fin_sqft
- Lot size
- Year Sold
- Year Built
- Number of Rooms
- Number of BedRooms
- Number of Stories
- Number of fbath
- District Type
- Style Type
- Extwall type
- Month
- Address
- Nbhd
- PropType (Due to huge imbalance)
- Units
- Dropping insignificant features: ['Address', 'PropType', 'Nbhd','month_sold','Hbath', 'Units']
- Converting categorical features into the correct datatype (object) before one-hot-encoding
- Perform One-hot-encoding for categorical features: ['District', 'Style', 'Extwall', 'Nr_of_rms', 'Bdrms', 'Fbath']
Stories | Year_Built | Fin_sqft | Lotsize | year_sold | Sale_price | District_1 | District_2 | District_3 | District_4 | ... | Bdrms_32 | Fbath_0 | Fbath_1 | Fbath_2 | Fbath_3 | Fbath_4 | Fbath_5 | Fbath_6 | Fbath_7 | Fbath_10 |
2.0 | 1913 | 3476 | 5040 | 2002 | 42000 | False | False | False | False | ... | False | False | True | False | False | False | False | False | False | False |
2.0 | 1897 | 1992 | 2880 | 2002 | 145000 | False | False | True | False | ... | False | False | False | True | False | False | False | False | False | False |
2.0 | 1907 | 2339 | 3185 | 2002 | 30000 | False | False | False | True | ... | False | False | True | False | False | False | False | False | False | False |
2.0 | 1890 | 2329 | 5781 | 2002 | 66500 | False | False | False | True | ... | False | False | True | False | False | False | False | False | False | False |
2.5 | 1891 | 7450 | 15600 | 2002 | 150500 | False | False | False | True | ... | False | False | False | False | False | False | False | True | False | False |
- Splitting dataset using train_test_split (80% training dataset, 20% test dataset)
Metric | Value |
MSE | 1,939,965,795 |
MAE | 28,216 |
R2 | 0.790 |
Metric | Value |
MSE | 2,047,237,885 |
MAE | 28,953 |
R2 | 0.778 |
Metric | Value |
MSE | 2,960,230,826 |
MAE | 34,375 |
R2 | 0.679 |
- From the preliminary model fitting, it showed that XGboost performed the best among the three models with a R^2 of 0.790. Hence, it is selected for fine-tuning.
param_grid = {
"learning_rate": [0.05, 0.10, 0.15],
"max_depth": [3, 4, 5, 6, 8],
"min_child_weight": [1, 3, 5, 7],
"gamma": [0.0, 0.1, 0.2],
"colsample_bytree": [0.3, 0.4]
# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=xgboost, param_grid=param_grid, scoring='neg_mean_squared_error', cv=5, verbose=0, n_jobs=-1)
# Perform grid search, y_train)
# Get the best parameters and best model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
# Make predictions on the test set
y_pred = best_model.predict(X_test)
Metric | Value |
MSE | 1,785,590,381 |
MAE | 27,957 |
R2 | 0.806 |
- The model improved slightly compared to the base-model.
- Revisit feature engineering
- Normalizing Variables
- Regularization
- Further expand the hyperparameters grid to improve performance
- The app will be capable of taking in end user's input of the features and predicting the sale price
- Link to the app:
- Feel free to test the app ! Let me know if you encountered any issues with the app.