Job Skill Matcher: (CODE CHALLENGE)

This is a reposiory for a code challenge. The task to address are as follows:

Given a skill return a list of related job titles
Given a skill return a list of related skills

Assumptions:

Skills are mostly NOUN or forms of NOUN.
Skills can be identified as Entity in a given text (We use spacy for NER)
Given a description, the most common skill/skills (or prominent concepts) can be extracted using word embedding (eucledian) space.

Approach:

Identify POS tags for all the words and keep only NOUN and forms of NOUN

Identify Named Entity and keep only a set of entities by this rule:

 a) Select only if belonging to a predefined set.(predefined set selected based on visual inspection.)
                     
 b) If single word, its POS be NOUN or its other form
 c) if multiword, atleast one word must be a NOUN or other forms

Keep a counter of all the neighboring words using word embedding similarity of selected candidates of step 2
Select top k most frequent neighboring words signifying the common concept within the text (or found entity)

Evaluate a similarity score between all the top k most frequent words of step 5 and candidate entity based on following:

 a) if single word, directly evaluate the similarity with top k and add them
 b) if multiword, evaluate similarity for each word with top k and add them followed by division by number of words in the candidate word.

Sort the candidate list of step 2 based on score of step 5.
Associate a weight to each node based on predefined set(visual inspection)
Add the title as a node to a Graph
Add the sorted skills identified in step 6 as nodes.
Join edges between title and edges based on step 7
Update the edge weight if new weight evaluated for any other doc is more than current weight

Execute

To execute

python methods.py --filepath XXX/Challenge/data/job_descriptions.json/all_en_descriptions.json --modelpath XXX/Challenge/models/fasttext_model.bin --t skill --name ASP.NET --neighbor title --n 5 --graphpath /media/druv022/Data2/Challenge/data/graph_temp.pkl

Pros:

a) Simple yet effective.
b) Can use any pre-trained embedding (supported in the form of [KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html))

Cons:

a) require speedup (dependency on spacy execution on CPU)
b) quality of data; requires better preprocessing
c) doesn't handle titles/skills that it has not seen; requires character based approach to start with.

Future ideas:

a) Use latest NER models (with ELMO/BERT embeddings)
b) Use graph based embeddings (node2vec for a start)
c) Learn weights of the edges based on the data rather than using predefined heuristics

As part of the coding challenge, another task was asked to write a proposal for a web crawler for job. Please find the proposal here: link

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
models		models
temp		temp
Proposal.pdf		Proposal.pdf
README.md		README.md
graph.py		graph.py
methods.py		methods.py
package-list.txt		package-list.txt
preprocess.py		preprocess.py
read.py		read.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Job Skill Matcher: (CODE CHALLENGE)

Assumptions:

Approach:

Execute

Pros:

Cons:

Future ideas:

About

Releases

Packages

Languages

druv022/JobSkillMatcher

Folders and files

Latest commit

History

Repository files navigation

Job Skill Matcher: (CODE CHALLENGE)

Assumptions:

Approach:

Execute

Pros:

Cons:

Future ideas:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages