Given weather data, the task is to predict whether it will rain tomorrow. The target variable RainTomorrow
is binary: Yes or No.
Library/Tool | Link | Icon |
---|---|---|
Scikit-learn | Scikit-learn | |
Seaborn | Seaborn | |
Logistic Regression | Logistic Regression | ⚙️ |
XGBoost | XGBoost | |
CatBoost | CatBoost | |
Plotly Express | Plotly Express |
The goal of this project is to develop a machine learning model that predicts whether it will rain tomorrow based on historical weather data. This involves analyzing various meteorological factors such as temperature, humidity, wind speed, and pressure to determine their influence on rainfall.
The dataset was obtained from kaggle , it contains data about weather data in Australia collected from 2008 to 2017 , here link The dataset contains very many weather related variables that may seem exiciting but the screenshot is only showing a few of the variables
The dataset had variables with missing values wwhere some variables had really high percentage of missingness where those wth high percentage of missigness were dropped
I calculated the percentage of each variable contributing to missingness in the dataset , we found out that some variables like Evaporation and Sunshine really have high percentage of missingness ,
The screenshot below shows a calculated percentage of missingness per variable , but what was of high corncern where the underlined variables with high Missingness , these variables need to be Handled carefully before modelling since they can introduce bias in the data
The high variables were dropped after checking there correlation with the Target RainfallTommorrow , since all of them had a negative corrrelaation with the target i decided to opt for deleting them , the screenshot sshows the correlation of the high variables with missigness with the target
The rest of the variables were imputed mean , i impuetd with mean since the remaing percentage of missigness was not to heavy , other imputation techiniques like KNN and iterative impueter came across my mind also
Since we were aiming to for a classification modelling , we had to understand the distribution of the dataset in hand to know if it will really handle parametric classification Models for example Logistic Regression , the screenshot tells us that most of our data is not heavily skewed , so there will be no need to Transform the variable so that we achieve normality
Hey , i carried out a An indifference student ttest using , My aim was to find out if there is a statiscal difference between the means of the numerical variabled with the Independent variable RainTommorrow , with two levels , thats is Yes meaning it will rain tommorrow and No meaninng it will not rain tommorrow The screenshot below show the ouput of the ttest , we conducted , we found out that most of the variable were statically significant after that we calculated the Effect size , Being that they are significant , there effect are too minimal
numeric feature shaded yellow having the largest effect size meaning it has the strongest relationship with the target variable
The first interaction we deed was about the numerical variable distribution with our target variable RainTommorrow , From the histograms, we identified patterns in key weather-related features that have a significant impact on predicting rainfall for the next day (RainTomorrow). These insights reveal:
- Humidity above 80% is a strong signal for rain tomorrow.
- High Cloud Cover at 9am and 3pm aligns with rainy days.
- Lower Temperatures increase the chance of rain (MinTemp and Temp9am).
- Most days have low rainfall (near 0), but extreme rainfall days align with "Yes" for rain. This skewed data could make predicting rainfall trickier unless handled properly.
- Wind-related features (moderate-to-high speeds) also correlate with rain events, making them useful predictors.
I found the proportion of the target variable across the Locations variable , we found out Raainfall patterns were varying significantly across different locations
- Locations such as Cairns, Darwin, and MountGambier have a significantly higher proportion of rainy days (more orange bars). Meanwhile, places like AliceSprings, Mildura, and Richmond show fewer rainy days (dominantly blue bars).
- Some locations, like NorfolkIsland and Portland, show more balance between rainy and non-rainy days, indicating moderate rainfall frequency.
I wanted to find the interactions between rainfall and Pressure values Pressure9am and Pressure3pm
- we found out as rainfall increases pressure values drop
- Pressure is an important predictor of weather patterns. Low pressure is typically associated with unstable weather, leading to rain.
Exploring the relationship between different years and the fluctuations in maximum and minimum temperatures provides valuable insights into climatic trends and patterns
- Both Maximum Temperature (orange) and Minimum Temperature (blue) display a clear seasonal cycle over the 9-year period (2008-2017).
- Peaks occur annually, likely during summer (hot months).
- Troughs occur annually, likely during winter (cold months).
- Maximum temperatures occasionally exceed 45°C, indicating extreme heat events.
- Minimum temperatures drop below -5°C, suggesting occasional extreme cold events.
- The gap between maximum and minimum temperatures is wider during hotter months (peaks), especially from 2012 onward.
- There appears to be clustering of extreme maximum temperatures in certain years (e.g., 2011, 2013, and 2016).