A collaborative project looking into the likelihood of infection between vaccinated and unvaccinated in the United States.
Topic: What is the likelihood of being infected by Covid-19? How is infection affected by factors such as vaccination rates, gender, and ethnicity?
Purpose: It is important to analyze future trends following the Covid-19 pandemic to understand the prevalence of infection within the American population.
Data Source: We gathered data from reliable organizations such as Johns Hopkins University and the Center for Disease Control (CDC) which provide csv files on their findings.
Questions to be answered: Are certain populations more likely to be infected than others? How do these factors affect the other? What other factors should be considered in identifying risks of infection?
Communication Protocols: Our group name is Endless Knot. The members exchange information on Slack and document notes on Google Docs. Group meetings are held virtually on Zoom. We collaborate on our codes through GitHub, which include our repository (CovidInfectionAnalysis), branches, commits, and pull requests.
• Data Wrangling: We checked out the data quality by sorting and filtering with PYTHON. We cleaned missing data and removed outliers. We omitted several unnecessary columns. We found the null values, used dropna() and converted strings to numbers.
We ended up working with 3 datasets from the following sources: Covid Data Tracker-CDC, Covid19 cases by State -Johns Hopkins University and Genderscilab
Using Vlookup we convert the State abbreviation to full name, draw common ground- Primary key and mapped relationship using Entity Relationship Diagram (ERD) and merged and analyze datasets using SQL and Pandas/Jupyter Notebook.
• Preparing data for machine learning • Importing libraries: pandas, NumPy, seaborn, matplotlib, sklearn, train_test_split, r2_score, mean_squared_error, sklearn.datasets, statsmodels.tsa.arima.model
• Read dataset • Activate ML environment in jupyter notebook mlev
• Convert strings to numbers using pd.get_dummies • Split the data into training and testing • Split the data into training and testing using StandardScaler() and X_train_scaled = X_scaler.transform(X_train) X_test_scaled = X_scaler.transform(X_test)
• We tried different Machine learning algorithm: Since our data is labeled, we used Supervised learning. We focused on Regression models because we are using data to make predictions in a continuous form.
we used several models:
-
Ordinary Least Squares (OLS)
-
Linear regression
-
SVM support vector machine
-
ARIMA for Time series
-
OLS model can predict an output value with an acceptable error margin, based on a set of known input parameters.
- Linear regression: coeffiecient of determinations : 0.57037
- SVM support vector machine : SVM or Support Vector Machine is a linear model used for classification and regression problems. It can solve linear and non-linear problems and work well for many practical problems.
An ARIMA model is a class of statistical models for analyzing and forecasting time series data. ARIMA stands for Autoregressive Integrated Moving Average. It is a generalization of the simpler Autoregressive Moving Average and adds the notion of integration.
The below summarizes the coefficient values used as well as the skill of the fit on the on the in-sample observations. The ARIMA model used is ARIMA(5, 1, 0)
Next, we get a density plot of the residual error values, suggesting the errors are Gaussian, but may not be centered on zero. The distribution of the residual errors is displayed. The results show that indeed there is a bias in the prediction (a non-zero mean in the residuals).
The graph below shows that A line plot is created showing the expected values (blue) compared to the rolling forecast predictions (red). We can see the values show some trend and are in the correct scale.
Plotly was an interactive platform that was used to help visualize the different covid factors used in this project. The two factors that we wanted to showcase through maps were gender infections and total percent of vaccinations. Two maps were created to take the states with the highest total of infections between men and women. For the state of California, it had the highest rate of infections for both men and women. Looking at the maps, men were more likely to get infected in Texas than women. In comparison, both genders are likely to get infected equally in the states with the highest amount of cases.
Another map that was created was to visualize the amount of fully vaccinated people in each state. This allows us to see which states has the most vaccinations and which had the least. We can determine that California, Oregon and Washington have a high percentage of vaccinations, while North Carolina has the least percentage of vaccinations.