Author : GAURAB KUNDU
This Project is the Applied Data Science Capstone Project of the IBM Data Science Professional Certificate
In this project, we will predict if the Falcon 9 first stage will land successfully. SpaceX advertises Falcon 9 rocket launches on its website, with a cost of 62 million dollars; other providers cost upward of 165 million dollars each, much of the savings is because SpaceX can reuse the first stage. Therefore if we can determine if the first stage will land, we can determine the cost of a launch. This information can be used if an alternate company wants to bid against SpaceX for a rocket launch.
This project utilized various data collection methods, including web scraping and API calls, to analyze SpaceX launch data. Through Exploratory Data Analysis (EDA) and machine learning, the goal was to develop a classification model that predicts whether the Falcon 9 first stage will successfully land.
The results showed that decision trees were the best model for predicting the success of the landings.
In the era of commercial space travel, reducing costs is crucial. SpaceX advertises its Falcon 9 rocket launches at a fraction of the cost of other providers, largely due to the reuse of the first stage. By predicting the success of these landings, companies can optimize their bids and potentially compete with SpaceX.
The project analyzes SpaceX’s launch data to explore how reusable rockets minimize launch costs and how these insights can be applied by other companies.
The key problem is predicting the landing success of the Falcon 9 first stage based on specific flight features.
- Can we predict the success of the Falcon 9 first-stage landing?
- What features influence the success of these landings?
The project utilizes IBM Cloud Pack for Data, requiring only a stable internet connection.
-
IBM Watson Studio
-
For local execution:
- Programming Language: Python
- IDE: Jupyter Notebook
- Libraries/Packages: Pandas, Numpy, Scipy, Scikit-learn, Matplotlib, BeautifulSoup
The data set used in the project is the SpaceX Launch data.
The data set contains information about different flights including date, launch site, booster version and more.
The data is collected by two main approaches:
-
The SpaceX API
-
Web Scrapping
We’ll be collecting launch data from SpaceX API, First we request launch data from SpaceX API using the GET command (requests.get), then we create a pandas dataframe from the response, After that we make several sub requests to get more detailed and consistent information about the IDs stored in the dataframe.
With the help of some helper functions, we save the responses into a dictionary, and then we transform it into a dataframe, which is our data set.
To see the code and step by step process of Data Collection using SpaceX API CLICK HERE
We will be performing web scraping to collect Falcon 9 historical launch records from a Wikipedia page. First we perform an HTTP GET( using requests.get command)method to request the Falcon9 Launch HTMLpage, as an HTTP response. Then we create a BeautifulSoup object from the HTML response, We extract the column names from the object and use it as dictionary keys.
We parse the HTML tables and fill the dictionary keys with launch records from table rows, and finally we transform it into a dataframe.
To see the code and step by step process of Data Collection with Web Scrapping CLICK HERE
Exploratory data analysis is an important step while preprocessing data, it is useful to find some patterns in the data and determine what would be the label for training supervised models.
This process was done in the following order:
-
First thing to do is to identify the data types of the columns.
-
Determine the number of values for each attribute.
-
Calculate the percentage of the missing values.
-
To determine the label, weapply zero/one hot encoding to the “Outcome” column to classify landing to either 1(Success) of 0 (Failure)
To see the code and step by step process of Data Wrangling CLICK HERE
In order to better understand the datasets, we ran the following SQL queries:
-
Display the names of the unique launch sites in the space mission.
-
Display 5 records where launch sites begin with the string 'CCA'.
-
Display the total payload mass carried by boosters launched by NASA (CRS).
-
Display average payload mass carried by booster version F9 v1.1 .
-
List the date when the first successful landing outcome in ground pad was achieved.
-
List the names of the boosters which have success in drone ship and have payload mass greater than 4000 but less than 6000.
-
List the total number of successful and failure mission outcomes.
-
List the names of the booster versions which have carried the maximum payload mass. Use a subquery.
-
List the failed landing outcomes in drone ship, their booster versions, and launch site names for in year 2015.
-
Rank the count of landing outcomes (such as Failure (drone ship) or Success (ground pad)) between the date 2010-06-04 and 2017-03-20, in descending order.
To see the code and step by step process of EDA With SQL CLICK HERE
In order to understand the relations between different features, we visualize the data by plotting scatter plots, bar charts and line charts, it helps finding hidden patterns in data and gain insights about the dataset.
-
Pay load mass against the Flight number.
-
Lunch site against the Flight number.
-
Lunch site against the Pay load mass.
-
Orbit type against Class success rate.
-
Flight number against Orbit type.
-
Orbit type against the Pay load mass.
-
launch success yearly trend.
To see the code and step by step process of EDA with Data Visualization CLICK HERE
Here, we complete the interactive visual analytics using Folium.
First we create Folium map object, with an initial center location around Nasa Johnson space center, Houston-Texas.
We add a circle on the map for each launch site from the dataset by creating a folium circle and folium marker, now the launch sites are marked on the map which means we can see which one is proximate to the equator line or close to a coastline.
In order to mark the success/failure launches, we create a marker on the map for each launch record from the dataset, a green marker indicates a successful lunching and a red one indicates failure,
we need to explore and analyze the proximities of launch sites, we calculate the distance between the launch site and its proximities and then we draw a polyline between them.
To see the code and step by step process of Build an Interactive Map with Folium CLICK HERE
Now that we finished the exploratory analysis, the next step is to determine the training labels and build a predictor using machine learning algorithms. After using the ‘Class’ column as the label, first thing to do is normalizing the data. We split the normalized data into test/train sets, The training data is divided into validation data, a second set used for training data.
For the model development phase, we use the following algorithms:
-
Logistic regression
-
Support vector machine
-
Decision trees
-
K nearest neighbor
We build a grid search object for each of the algorithms and f i t it to find the best parameters of the model(hyper parameters tuning), then we choose the most accurate model.
To see the code and step by step process of Predictive Analysis (Classification) CLICK HERE
Success rate increased noticeably from 2013 and on.
Launch site and the orbit type are the features with the largest effect on the outcome.
KNN and SVM models have a validation set accuracy of 83% and an out of sample accuracy of 77%.
Scatter plot of Flight Number vs. Launch Site
Flight with number range 0 to 20 and range 40 to 90 are more on site CCAPS SLC 40.
Flight with number range 21 to 39 is more on site KSC LC 39A.
Scatter plot of Payload vs. Launch Site
More data spread on payloas mass range 0 to 8000 kg.
When the payload is in the range of 15000, it looks more likely to land successfully.
Show a bar chart for the success rate of each orbit type
O rbit type ES L1, GEO, HEO, and SSO have the highest success rate, which is 1, that means it always succeeds.
Orbit type GTO have the lowest success rate, which is 0.5.
Show a scatter point of Flight number vs. Orbit type
Orbit type LEO, ISS, PO, and GTO have more data spread on flight number range 0 to 60 .
Show a scatter point of payload vs. orbit type
Orbit type VLEO that has high success rate also has heavy payload.
There is a possibility that the heavier the payload, the higher the probability of success.
Show a line chart of yearly average success rate
2019 is the year that has the highest success rate.
2010, 2012, and 2014 are the year that have lowest success rate.
N ames of the unique launch sites
There is 4 unique launch site.
That’s mean there is 4 kind of launch site too.
5 records where launch sites with 'CCA'
There is 5 launch site begin with CCA which mission outcome all success.
There is 4 launch site that sponsored by NASA.
Calculate the total payload carried by boosters from NASA
The total payload carried by booster from NASA is 45.596 kg.
Calculate the average payload mass carried by booster version F9 v1.1
The average payload mass shown is 2.928 kg for 90 payload mass withtotal payload is 45.596 kg.
Find the dates of the first successful landing outcome on ground pad
T he first successful landing outcome on ground ad at 2015 12 22 .
List the names of boosters which have successfully landed on drone ship and had payload mass greater than 4000 but less than 6000
There is 4 booster version that successfully landed on drone ship and had payload range 4000 to 6000.
Calculate the total number of successful and failure mission outcomes
The total of successful mission outcome is 100 and failure in flight is 1.
List the names of the booster which have carried the maximum payload mass
There is 12 booster version type that carried maximum payload mass.
List the failed landing_outcomes in drone ship, their booster versions, and launch site names for in year 2015
The failed landing outcomes in drone ship is always happen in CCAFS LC 40 launch site.
There are 2 type of booster that used when failed landing outcomes in drone ship.
Rank the count of landing outcomes (such as Failure (drone ship) or Success (ground pad)) between the date 2010 06 04 and 2017 03 20, in descending order
There are 9 landing outcomes that success ground pad and 5 landing outcomes that failure drone ship.
The markers on this maps show the launch site locations on the map.
A green marker represents a successful landing outcome, while a red one represents failure.
The blue line represents the distance between the lunch site and the closest coastline.
To See the code for making the Dashboard using Plotly Dash CLICK HERE
Showing the screenshot of launch success count for all sites, in a Piechart.
Showing the screenshot of option
Showing the screenshot of the piechart for the launch site with highest launch success ratio.
The highest total launches is KSC LC-39A site with 41.7%.
These two graphs represent the confusion matrix for both the SVM and KNNmodels.
These confusion matrices show the largest true positive and true negative values, as well as the least false positive and false negative values.
Not all the data is important, the collected data may contain irrelevant columns and it is normal to drop them. Visualizing data is a good way of determining what features have the strongest effect. SQL queries provide wider scope to explore datasets in comparison with traditional EDA. SVM and KNN models are the most reliable since they have the highest out of sample accuracy and f1-score.