Document similarity to get job similar to each other + force directed graph in D3 for vizualisation.
Objectives
The goal of this post is to find jobs that are similar to each others,
using public data only, mainly focused on tasks required for each jobs.
Data
We used data available on INSEE (French governemental agencies for statistics) website.
You can find data about ROME at the following link data for each type of job.
We can find the following informations:
CODE ROME: it's an ID (one letter followed by four numbers) used in France to classify jobs.
Libellé ROME: which is just the name of the job.
Tasks related to ROME: we found another dataset to link task to related job (by CODE ROME).
Method
Document similarity is mainly computed using description of tasks related to job.
Therefor, when two jobs require similar skill sets and tasks, they will be close.
The main steps are the following:
Stop words removal using a dictionnary
Stemming of words
Creation of a TF-IDF matrix
Computation of a cosine similarity between documents (using TF-IDF matrix as input)
For the final visualization, I reduced the number of job available (health sector only) but for a complete application,
I should probably introduce a new widget to allow user to filter depending on what he want to see.
Here we define function and custom transformers and estimators.
We can see that everything is in a custom transformer to be introduced in a scikit-learn Pipeline.
I think it's a really good way to have clean code and to reuse snipset of code in many projects.
Again, it's just a sample of job for demonstration purpose. We kept only job related to health sector
for the visualization.
Final Result
And here is the final result ! You can see that jobs are filtered so that we can represent it in a force directed graph.
The strength of the similarity is given by edge's size.
Here we just show a link if the proximity is above a a fixed threshold.
Result for Health-related jobs
Please use scrolling to see the visualization in full.