Ayanwoye Gideon Ayandele – [email protected]
- Project Description.
- Web Scraping (
ycombinator_scraper.ipynb
/ycombinator_scraper.py
) - Data Wrangling and Exploration (
EDA_ycombinator.ipynb
) - Analysis Summary
- Details of Charts
- References
The motivation for this project is to achieve a very basic end-to-end data engineering project by collecting/scraping, wrangling, cleaning and analysing/visualizing companies' information listed on https://ycombinator.com/companies.
The project main objectives were:
- Perform web scraping
- Do data wrangling (gathering, assessing and cleaning) on the crawled data.
- Store, analyze, and visualize the wrangled data.
- Reporting on:
- data wrangling efforts.
- data analysis and visualizations
The project was divided into two parts:
- Web Scraping (
ycombinator_scraper.ipynb
/ycombinator_scraper.py
) - Data Wrangling and Exploration (
EDA_ycombinator.ipynb
)
The dependencies and third party libraries for the scraper include:
Selenium
BeautifulSoup
requests
numpy
pandas
- I scraped data pertaining to all 1000 companies listed on https://ycombinator.com/companies, which are:
- The listed company names
- The company's ycombinator page url
- The company location
- The company short description (Description head) using the selenium library since the page is dynamic.
- I then went through the scraped company's ycombinator page url using requests library since the pages are static, and grab many other informations (company's description, year founded, team size, company page url, social media urls, management details) as they appear for each company.
- At the end, I created a CSV file in the following format:
Company_Name | Company_Page_URL | Company_Location | Description_Head | Website | Description | Founded | Team_Size | Linkedin_Profile | Twitter_Profile | Facebook_Profile | Crunchbase_Profile | Active_Founder1 | Active_Founder2 | Active_Founder3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Airbnb | https://www.ycombinator.com/companies/airbnb | San Francisco, CA, US, | Book accommodations around the world. | http://airbnb.com | Founded in August of 2008 and based in San Fra... | 2008 | 5000 | https://www.linkedin.com/company/airbnb/ | https://twitter.com/Airbnb | https://www.facebook.com/airbnb/ | https://www.crunchbase.com/organization/airbnb | Nathan Blecharczyk\nNone\nhttps://twitter.com/... | Brian Chesky\nNone\nhttps://twitter.com/bchesky\n | Joe Gebbia\nNone\nhttps://twitter.com/jgebbia\n, |
- The scraper runs for approxiamtely 1.5 minute with multithreading and approximately 7 minutes when NOT multithreaded
The dependencies and third party libraries for the EDA include:
numpy
pandas
matplotlib
seaborn
The summary from the data assessment and cleaning were that:
- There were cases of duplicated company names (Nash, Atlas and Streak) which appeared twice but had their characteristics to be different from the duplicate, it was then concluded to neglect the issue.
- Missing data were represented with NaN which would not be imputed or removed as they represented charateristics that were not for the particular company
- New variable showing the
Country_Of_Origin
of the company was extracted from theCompany_Location
column and, another variableNumber_Of_Founders
was also extracted fromActive_Founder1
through toActive_Founder6
Using both Univariate and Bivariate analysis:
-
The most represented country of all is the USA which counts 654 of the total 1000 companies. It is followed by India, Canada, UK, Nigeria and Indonesia
-
It could be seen that the more recent a company is founded, the likely it is to be funded/listed by ycombinator
-
The Team size distribution is highly right-skewed with a really long tail that it was very difficult to view the plot. I had to resolve into binning of size 100 and also set the plot's x_axis limit to 3000. Most teamsize is between 2-4
-
Most number of founders is 2 followed by 1 and 3
-
No interesting relationship between country of origin and team size, number of founder and year founded. Also
-
There is a weak, negative linear correlation between Number_Of_Founder and team size.
-
Most represented country (Country_Of_Origin) on ycombinator: The most represented country of all is the USA which counts 654 of the total 1000 companies. It is followed by India, Canada, UK, Nigeria and Indonesia:
-
The distribution of the Year founded of the companies: It could be seen that the more recent a company is founded, the likely it is to be funded/listed by ycombinator:
-
The distribution of the team size of the companies: The Team size distribution is highly right-skewed with a really long tail that it was very difficult to view the plot. I had to resolve into binning of size 100 and also set the plot's x_axis limit to 3000. Most teamsize is between 2-4:
-
The distribution of the Number_Of_Founder of the companies: Most number of founders is 2 followed by 1 and 3:
There is no interesting relationship between country of origin and team size, number of founder and year founded. Also there is a weak, negative linear correlation between Number_Of_Founder and team size.