Objectives

The goal of this post is to find jobs that are similar to each others, using public data only, mainly focused on tasks required for each jobs.

Data

We used data available on INSEE (French governemental agencies for statistics) website. You can find data about ROME at the following link data for each type of job. We can find the following informations:

  • CODE ROME: it's an ID (one letter followed by four numbers) used in France to classify jobs.
  • Libellé ROME: which is just the name of the job.
  • Tasks related to ROME: we found another dataset to link task to related job (by CODE ROME).

Method

Document similarity is mainly computed using description of tasks related to job. Therefor, when two jobs require similar skill sets and tasks, they will be close.
The main steps are the following:

  1. Stop words removal using a dictionnary
  2. Stemming of words
  3. Creation of a TF-IDF matrix
  4. Computation of a cosine similarity between documents (using TF-IDF matrix as input)

For the final visualization, I reduced the number of job available (health sector only) but for a complete application, I should probably introduce a new widget to allow user to filter depending on what he want to see.

Library & constant definition
    
# -*- coding: utf-8 -*-
import pandas as pd
from nltk.stem import SnowballStemmer
from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
import networkx as nx
from networkx.readwrite import json_graph
import json
import scipy
import numpy as np

PATH = "U:/D3.Js/Metiers_competences/"
PATH_DATA = PATH + "Data/"
PATH_PLOT = PATH + 'Plot/'
RANDOM_STATE = 42
THRESHOLD = 0.2

stop_words =\
    list(pd.read_csv(PATH_DATA + "stop_word.txt", header=None).iloc[:, 0])
stemmer = SnowballStemmer("french", ignore_stopwords=False)

    
  

Here we define function and custom transformers and estimators.

Function & class definition
    

def create_data_set(url_def, url_job, grand_domaine="INDUSTRIE"):
    definitions =\
        pd.read_csv(url_def,
                    sep=";")
    definitions =\
        definitions.loc[definitions.loc[:, "Grand Domaine"] == grand_domaine, :]
    definitions = definitions.sort_values(by="Libellé")
    definitions =\
        definitions.loc[:, ["Libellé", "code_rome"]].drop_duplicates()
    jobs =\
        pd.read_csv(url_job,
                    sep=";").rename(columns={"Code ROME":"code_rome"})
    jobs = jobs.loc[:, ["code_rome", "ROME Libellé"]].drop_duplicates()
        
    data = definitions.merge(jobs, how="left", on="code_rome")
    data.columns = ["jobname", "code_rome", "definition"]
    return(data)

class CleanText(BaseEstimator, TransformerMixin):
    """ Apply stemming in text column
    
    Parameters
    ----
    text_col: list of columns to apply stop_word deletion
    stemmer: Stelmer t ouse (from nltk)
    
    Attributes
    ----
    Return pandas dataframe with transformed text column:
        - stop word are deleted
        - stemming is applied
        - every word are converted to lowercase
    
    """
    def __init__(self, text_col=None, stemmer=None, stop_word=None):
        self.stemmer = stemmer
        self.text_col = text_col
        self.stop_word = stop_word
    def __clean_text__(self, x):
        return((' '.join(self.stemmer.stem(w.lower()) for w in x.split() 
            if w.lower() not in self.stop_word)))
    def fit(self, df, y=None, **fit_params):
        return self
    def transform(self, df, **transform_params):
        for col in self.text_col:
            df.loc[:, col] = df.loc[:, col].map(self.__clean_text__)
        return df
    
    
class CosineSimilarity(BaseEstimator, ClassifierMixin):  
    """Estimator for cosine similarity Input is tf-idf matrix"""
    def __init__(self, otherParam=None):
        self.otherParam = otherParam
    def fit(self, X, y=None):
        return self
    def predict(self, X, y=None):
        return((X * X.T))
        
class keeped_variable(BaseEstimator, TransformerMixin):
    """Indicate variable to keep in dataset.
    To check that training and test set contain same variables
    """
    def __init__(self, variable_to_keep):
        self.variable_to_keep=variable_to_keep
    def transform(self, X):
        result=X[self.variable_to_keep]
        return result
    def fit(self, X, y=None):
        return self
    
class PandasToSeries(BaseEstimator, TransformerMixin):
    """ Transform a pandas.DataFrame to a numpy.array """
    def __init__(self, col_tokeep):
        self.col_tokeep = col_tokeep
    def fit(self, df, y=None):
        return self
    def transform(self, df, **transform_params):
        return df.loc[:, self.col_tokeep]
    
  

We can see that everything is in a custom transformer to be introduced in a scikit-learn Pipeline. I think it's a really good way to have clean code and to reuse snipset of code in many projects.

Again, it's just a sample of job for demonstration purpose. We kept only job related to health sector for the visualization.

Main code
    

if __name__== "__main__":
    ############################"
    #    DOCUMENT SIMILARITY
    ############################"
    data = create_data_set(PATH_DATA + "pole-emploi-rome-arborescence-principale.csv",
                           PATH_DATA + "rome-code-rome-definitions.csv",
                           "SANTE")
    data = data.groupby('jobname')['definition'].apply(lambda x: ' '.join(x))\
               .reset_index()
    clean_text = CleanText(["definition"], stemmer, stop_words)
    
    pipe = Pipeline([
            ("clean_text", clean_text),
            ("select_variable", keeped_variable(["definition"])),
            ("to_numpy", PandasToSeries("definition")),
            ("tf_idf", TfidfVectorizer(max_df=0.9, min_df=0.1)),
            ("cosine", CosineSimilarity())
            ])
    pipe.fit(data)
    pca_res = pipe.predict(data)
    
    ############################"
    #    CREATE GRAPH
    ############################"
    cx = scipy.sparse.coo_matrix(pca_res)

    G=nx.Graph()

    for job_number, job in enumerate(data.jobname):
        G.add_node(job_number)
        G.node[job_number]['label'] = job
    
    for i,j,v in zip(cx.row, cx.col, cx.data):
        if v > THRESHOLD:
            if i != j:
                G.add_edge(np.asscalar(i), np.asscalar(j))
                G[np.asscalar(i)][np.asscalar(j)]['strength'] = v
            
    pos = nx.spring_layout(G)
        
    graph = json_graph.node_link_data(G)
    with open(PATH_PLOT + 'graph.json', 'w') as f:
        json.dump(graph, f, indent=4)
    
  

Final Result

And here is the final result ! You can see that jobs are filtered so that we can represent it in a force directed graph. The strength of the similarity is given by edge's size. Here we just show a link if the proximity is above a a fixed threshold.

Result for Health-related jobs

Please use scrolling to see the visualization in full.