Exoplanet-Habitability-Prediction

Uses the NASA Exoplanet Archive and PHL data to predict whether an exoplanet is habitable

Introduction

This project used data available in the NASA exoplanet and planetary systems to predict whether an exoplanet is habitable. This data is raw so we have used data-cleaning techniques and approximations to get it into a usable form. We have initially used planet masses, radius, stellar flux, surface temperature, orbital period, and the distances of each planet from its star to check whether it is in the Goldilocks zone. These parameters helped in determining whether the planet lies inside the Goldilocks zone or not. But we have also collected other available data to check whether those parameters could improve the accuracy of the model. Using different models on our cleaned data, we have performed a comparative study for performance and accuracy and have presented our conclusions in the form of a graph.

The main processes that we are covering as part of this project are as follows:

Data cleaning and transformation
Analysis and data visualization
Applying Principal Component Analysis (PCA) for feature reduction
Modeling different classification algorithms
Result comparison and reporting

Feature Selection

Our dataset contains features that are missing 40% of its values. Missing values can be handled by replacing them with the mean value of the column data. However, in the detection of exoplanets, this method doesn’t give great results for Mass and Radius, which are two main funda- mental properties. Since only for a relatively small fraction of exoplanets are they both available, we are computing the missing mass and radius value by using the following formulas of the Mass – Radius relation of Chen & Kipping (2017). After cleaning and preprocessing our dataset, we have 4970 data points. 57 are the observations of habitable planets, and 4913 belong to non-habitable planets.

Preliminary analysis of the data:

The number of stars is only related to the number of planets and with the rest of the features, it has a negative correlation (and vice versa for the number of planets).
The number of moons is an independent feature in our data set as it has no relation to any of the other features.
The orbital period has a strong correlation with stellar effective temperature.
Planet mass is showing a strong correlation with luminosity and temperature. This means the greater the mass, the hotter and brighter the exoplanet.
Planets having large radii have high luminosity.

PCA Loading Vector analysis of the data:

Number of Stars and Radius are highly correlated to each other.
Luminosity and Surface Gravity are inversely correlated with each other.

After looking at our PCA results, we are going forward with the following features: Number of Planets, Orbital Period, Planet Radius, Planet Mass, Stellar Effective Temperature, and Stellar Surface Gravity.

Model Building

For the classification task, we decided to go ahead with four classifiers:

K Nearest Neighbour(KNN)
Logistic Regression
Multi-Layer Perceptron Classifier (MLP)
Support Vector Machines (SVM)

Since our data is highly imbalanced (there are very few potentially habitable planets) we have used sampling techniques to resolve the class imbalance. In sampling, either data from the class which has larger values is removed - under-sampling or additional samples are added to the class with smaller values - oversampling. We have used different sampling techniques in our project to achieve the best results. The sampling techniques used are:

SMOTE
ADASYN
SMOTE- Tomek
ClusterCentroids

The data sets used include:

All the original features
The features we selected
Features in transformed space - Ur including all PCs
Features in transformed space - Ur including top 6 PCs
Features in transformed space - U including all PCs
Features in transformed space - U including top 6 PCs

Because we have used multiple classifiers with multiple sampling techniques and different data sets we have 96 rows of results with F1 score, precision, recall, AUC, and confusion matrices, all of which are attached along with our code and report in an excel file. But the most important metric is recall value here as we want to correctly get all potentially habitable exoplanets even if at the cost of false positives.

Conclusion

Comparing the datasets, models, and samplers, we can see that the ’U - All PCs’, ’PHL - All Features’, and ’Ur - All PCs’ datasets provide the most information, the KNN model consistently get high F1 and recall scores, and the ADASYN, SMOTE, and SMOTETomek resampling methods generate the best distribution for the models. In contrast, the Logistic Regression model performs the worst with 80% recall compared to all datasets and samplers providing good recall scores throughout. The best model in our test set turns out to be the KNN model trained on ADASYN sampled U - All PCs dataset with 0.98 F1-score, 96% precision, 100% recall, and 0.98 AUC. The model also trains very quickly as it only takes 0.28 seconds for the entire run. This is the model that we propose as the solution to predicting the habitability of an exoplanet given its features in the U matrix format.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
README.md		README.md
Report.pdf		Report.pdf
model_building.ipynb		model_building.ipynb
preprocessing.mlx		preprocessing.mlx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exoplanet-Habitability-Prediction

Introduction

Feature Selection

Model Building

Conclusion

About

Releases

Packages

Languages

License

SauravSJK/Exoplanet-Habitability-Prediction

Folders and files

Latest commit

History

Repository files navigation

Exoplanet-Habitability-Prediction

Introduction

Feature Selection

Model Building

Conclusion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages