Zacharia Schmitz
Joshua Click
October - November 2023
-
Decide Source & Scope
-
Acquire Job Postings
-
Data Cleaning
-
Text Preprocessing
-
Feature Extraction
-
Model Training
-
Deliverable / Dashboard (Presenting Model-Validated Findings)
Deliver insights to potential data job applicants.
We developed a dynamic tool that empowers aspiring data analysts to navigate the job market with precision and confidence.
Our project uses 33,000 job postings that were scraped from Google and then processed with machine learning models to validate the results. We're not just sifting through text; we're decoding the jargon of job descriptions to help future data analysts on the hunt for their dream roles. We provide them with a competitive edge by revealing the skills, qualifications, and trends that matter most.
-
Data scientists and engineers have a more technical skillset compared to analysts
-
Data scientists and engineers typically make more than analysts
-
Data analyst roles oftentimes aren't clearly defined and include more advanced skills like machine learning
We originally intended on pulling all of the data myself using LinkedIn webscraping or another job resource.
We were able to use a scraper for LinkedIn, but after reading into it, they don't like that and have been known to send cease and desist letters.
For decent analysis, we would need a fairly large dataset.
With most job postings not including pay information, this would increase the demand for a large dataset even more.
-
You'll need an API key from their dev website
-
You can get 100 search queries per day for free.
-
At a cost of $5 per 1,000, you can get up to 10,000.
-
The downside to using their API, people often complain that their API results, aren't true to what searches are actually returning.
-
We also was not able to see if people could use the API for job posting searches.
-
While Google does not officially allow it, scraping the search engine results page (SERP) is also an option
-
Google seems to have very sophisticated technology when it comes to scraping their pages
-
If you scrape at a rate higher than 8 keyword requests per hour you risk detection
-
If you push it higher than 10 per hour, this will oftentimes get you blocked
-
By using multiple IPs you can up the rate (100 IPs = 1,000 requests)
-
There is also an open source search engine scraper written in PHP, that can manage proxies and other detection avoidance methods
-
There seem to be many services that offer to do the webscraping
-
The one that the Kaggle dataset used to scrape was SerpAPI
-
Their cost was as low as $50 for 5,000 searches per month or as high as $250 for 30,000 searches a month
Field Name | Description |
---|---|
Unnamed: 0 |
Appears to be an auto-incremented identifier. |
index |
Another identifier, possibly redundant with "Unnamed: 0". |
title |
Job title. |
company_name |
Name of the company offering the job. |
location |
Location of the job. |
via |
Source/platform where the job was posted. |
description |
Detailed description of the job. |
extensions |
Additional information about the job (e.g., job type, benefits). |
job_id |
A unique identifier for the job, possibly encoded. |
thumbnail |
URL to a thumbnail image associated with the job/company. |
url |
URL for the job posting. |
company_description |
Description of the company. |
company_rating |
Company's rating. |
rating_count |
Number of ratings the company received. |
job_type |
Type of the job (e.g., full-time, part-time). |
benefits |
List of benefits provided by the company. |
posted |
When the job was posted. |
deadline |
Application deadline for the job. |
employment_type |
Employment type (e.g., full-time, contract). |
commute_time |
Information on commute time, if available. |
salary_pay |
Salary payment value, if available. |
salary_rate |
Salary rate (e.g., per hour, per year), if available. |
salary_avg |
Average salary for the job, if available. |
salary_min |
Minimum salary for the job, if available. |
salary_max |
Maximum salary for the job, if available. |
salary_hourly |
Hourly salary, if available. |
salary_yearly |
Yearly salary, if available. |
salary_standardized |
Standardized salary information, if available. |
description_tokens |
List of skills extracted from the job description. |
-
Drop Columns
-
Check for Duplicates
-
Handling Missing Data
-
Work From Home
-
Feature Engineering - Standardizing Salary
-
Standardize Location Column
-
Date Formatting
-
Standardize Job Title
-
Job Description NLP Processing
-
Creation of
description_cleaned
-
Define Keywords for
description_tokenized
-
Schedule Types
-
What companies have the most job postings?
-
What is the location spread for our dataset?
-
Within the Google Jobs search, which site has the most postings?
-
What words are most common in data job descriptions
-
What are the overall top skills to learn for data jobs?
-
Do a majority of places allow work from home or want you in the work place?
-
What skills are most prevalent in our postings for programming languages, machine learning methods, tools?
-
What time of year do we see most data jobs being posted?
-
What are the most desirable skills?
Utilize GridSearch for best parameters for TF-IDF and LogisticRegression
Unbalanced DataSet
# Create a pipeline
pipeline = Pipeline(
[
("tfidf", TfidfVectorizer()),
("logreg", LogisticRegression(max_iter=1000, random_state=321)),
]
)
param_grid = {
"logreg__C": [5, 10, 20],
"logreg__penalty": ["l1", "l2"],
"tfidf__max_df": [250, 500, 750, 1000],
"tfidf__max_features": [500, 750, 1000],
"tfidf__min_df": [50, 100, 150],
'tfidf__ngram_range': [(1, 1), (1, 2)]
}
Balanced DataSet
# Create a pipeline
pipeline = Pipeline(
[
("tfidf", TfidfVectorizer()),
("logreg", LogisticRegression(max_iter=1000, random_state=321)),
]
)
param_grid = {
"logreg__C": [5, 10, 20],
"logreg__penalty": ["l1", "l2"],
"tfidf__max_df": [250, 500, 750, 1000],
"tfidf__max_features": [500, 750, 1000],
"tfidf__min_df": [50, 100, 150],
'tfidf__ngram_range': [(1, 1), (1, 2)]
}
Unbalanced Dataset (92% baseline)
tfidf = TfidfVectorizer(max_df=1000, max_features=1000, min_df=100, ngram_range=(1, 2))
logit = LogisticRegression(C=5, penalty="l2", random_state=321, max_iter=1000)
Train Set(Mean with 2 cross-validations):96%
Test Set(Mean with 2 cross-validations):90%
Balanced Dataset(33% baseline)
tfidf = TfidfVectorizer(max_df=1000, max_features=750, min_df=50, ngram_range=(1, 2))
logit = LogisticRegression(C=5, penalty="l2", random_state=321, max_iter=1000)
Train Set(Mean w/ 2 cross-validations):97%
Test Set(Mean w/ 2 cross-validations):87%
-
Clone this repo
-
Download CSV into /support_files/
- name this "jobs.csv" - Unprepped .csv (takes 10 minutes to prep)
- name this "prepped_jobs.csv" - Prepped .csv
-
Run the notebook.
-
We down-sampled our dataset in order to demonstrate an accurate model
- Another option, with more time, would be to collect more Data Scientist and Engineer positions
- Data Analyst, will always have more representation than the other two, just due to more Analyst positions
-
Our model is currently only being used to prove that our analysis of the data job skills are different between the three
titles
-
Further data validation could be performed by including trigrams and quadgrams, but very computationally expensive
-
Decided to use the 'Full-Time' positions only to deal with outliers
- Freelance jobs will often pay much more, but don't guarantee employment or have benefits
- Freelance jobs also are not very applicable to entry level applicants
-
Trying to categorize by
sector
proved to be too inaccurate from the nature of the descriptions in the job posts- Being able to accurately categorize by sector would add value, but would take too much time for this scope
- Being able to accurately categorize by sector would add value, but would take too much time for this scope
-
Dataset had very little job positions for engineer/scientists due to the original search term being "Data Analyst"
- Scraping for all 3 search terms would add insight for the under-represented categories
- Scraping for all 3 search terms would add insight for the under-represented categories
-
location
in the dataset was primarily from one geographic area and did not include positions from the entire U.S.- Although the search was for the entire United States, it seems it was limited to a specific region
- If this was due to IP address, area could be more diversified by using a proxy
-
date_posted
provided insights that certain fiscal quarters have increased hiring -
We were able to distiguish skills for each
title
represented in the dataset- this was validated by using a classification model to predict the
title
- this was validated by using a classification model to predict the
-
salary
was only present in 18% of the job postings. This represents a known issue for job searchers of no salary in the posting
-
Presenting data with an interactive graph can allow for users to answer their own potential questions
-
Rather than having scrolls of graphs, it could also be summed up with an interactive graph
Validation:
- Set up a validation framework to periodically test the model on new job scrapings from Google and ensure its predictions remain accurate over time
- If the model suddenly is inaccurate, this could represent a shift in the desired skills over time
Continuously add data for continued data insights
- Expand to the entire United States, rather than the limited geographic region
- Continue scraping posts, to potentially identify upward and downward trends in certain skills desirability