/Making Amazon Hiring AI Unbiased

Making Amazon Hiring AI Unbiased

Why make an algo?

The news of gender-bias in Amazon’s hiring algorithm is all over the internet and this has opened a new thread on the topic of interpretability of machine learning models. Let me give you a backdrop of the story. Amazon has a headcount of at least 575,700. If average tenure of an employee is 3 years, they need to hire (1,91,900 + increase in headcount) every year. If 1 selection is made out of every 5 candidate interviews and 1 candidate is selected out of every 3 resumes, they need to check 191900*3*5 = 2,878,500 number of resumes every year even if the headcount remains same. These numbers – 3 and 5 – will vary for different profiles as delivery boys are easier to hire compared to engineers but let us not make this calculation complex unnecessarily. The point is to gauge how big this number can be and how much effort and resources are spent on it.

If you have ever done interviewing, you will agree how boring resume filtering can be – especially if it’s your job. It’s a repetitive pattern based work – something AI is very good at. Hence it makes complete sense for an innovative giant company like Amazon to understand their own hiring practices and replicate it with algorithms. Since resume and job description is text data – we need to leverage NLP(Natural Language Processing).

If I had to make the algorithms myself, I would use this pipeline and it’s probably what Amazon also did.

  • Pre-process resume text
  • Vectorise the text with TF-IDF or BM25
  • Train a supervised classifier for very repetitive entry level profiles of logistics and engineering. We can also do it for non-entry level profiles if the data is enough. The classifier can be anything such as Naive-Bayes, RandomForest or Deep learning sequence model and categories for classification are selected and not-selected
  • Predict the probability of new resumes being selected
  • Filter resumes which are more than a cutoff probability like 0.8
  • Select top x profiles by probability for the interview where x depends on the number of candidates we want to hire and past conversion ratios

Another approach to this can be doing a similarity match of resumes with job description by Lucene/Elasticsearch and selecting top k results with a cutoff similarity score. The top results ensure match for JD and not for how fit they are for the role and hence this approach is not very appropriate.

The problem

Now lets delve into what was the news about: their new recruiting engine did not like women. Top U.S. tech companies have yet to close the gender gap in hiring, a disparity most pronounced among technical staff such as software developers where men far outnumber women. Amazon’s experimental recruiting engine followed the same pattern, learning to penalize resumes including the word “women’s” until the company discovered the problem*.

The current views of people on the news are:

1) The immediate response of people was AI is flawed.
2) AI will only be as biased as the data. Hence AI has disclosed that recruiters of Amazon maybe biased towards men.
3) Amazon is a company brave enough to disclose a flaw in their model. Most companies wouldn’t do this.

Solution to the problem

Now I want to discuss the part on how to make the algorithm unbiased. The problem is of the lower importance of words which are coming in resumes of women because these words are seen less in selected resumes. Amazon’s system penalized resumes that included the word “women’s,” as in “women’s chess club captain.” and it downgraded graduates of two all-women’s colleges*. There can also be issues related to words of ethnicity.

Since gender and ethnicity words are not an indicator of skills of a person, we can map these words to a common token like AAA. So now both Men’s chess club captain and Women’s chess club captain are mapped to AAA’s chess club captain. So if AAA’s chess club captain came in selected candidates, both men’s and women’s resume will be given equal importance for these words. Also, it’s not just about single words – men or women. While the vectorising is done, we also create bi-gram and tri-gram features which will be “AAA’s chess” and “AAA’s chess club” in this case – which would have been different earlier containing words men and women.

So all we need is a bias removal text pre-processing step before vectorisation where we map gender/ethnicity words to a common token. A list of such words can be collected by HR’s observation or from a list(not all words in this list are useful). In my opinion, this exercise and experiment doesn’t prove AI is flawed but throws light on a common knowledge that AI is as good as the data and if data is dirty, it requires cleaning.


It is sad to see they solved the bias but discarded the project as the article mentions – “Amazon edited the programs to make them neutral to these particular terms. But that was no guarantee that the machines would not devise other ways of sorting candidates that could prove discriminatory.” Like all researches, AI is also iterative in nature. Amazon spent considerable time making the algorithm and now that the flaw is discovered and rectified, it has lead to a better algorithm. Only by going through these cycles of improvement we can hope to achieve a near perfect unbiased algorithm. I am not sure why Amazon shut it down.

The article also mentions of resumes containing words like ‘executed’ and ‘captured’ being scored unusually high. Taming the algorithm requires an in-depth understanding of both vectorisation and classification algorithm. TF-IDF/BM25 can cause havoc when it sees a highly unusual word in the resume. A rare word has high IDF value and hence TF-IDF value can turn out large. The classification algorithm can also give a very high weight to these unusual words leading to strange results. Such words have to be found out by text exploration, model feature importances and algorithms for interpreting trained ML models. Once discovered, they can be removed from the vectorisation process manually or by a certain logic or just by keeping a high value of minimum document frequency. This helps in reducing the number of features(words) and helps cure overfitting. But this can also remove good features from the model which can decrease the accuracy of the model which concerns the data scientist.

As far as I know, these problems are also found in recommendation algorithms which Amazon is good. It’s all a game of how large+varied the training data is and the math applied on it. Ideally, the dataset should be huge+varied and algorithm should be tested on a large dataset. This type of problem arise when training data is less and hence overfitting and bias start coming into the play. The only way to remove this is to have a huge dataset which might not be possible because it is constrained by its own hiring(selected/unselected candidate) data. So we can estimate what amount of data we might need and how many years it might take to collect. If the years required is large/uncertain, it makes sense to shut down the project. People might think AI has failed but it might be a data problem and that’s why Amazon might have shut it down for now. Remember why deep learning started working suddenly a few years back? Access to a large amount of tagged data, better computation and improvement in algorithms.

Amazon has not only discovered flaws in its own model but also in the models of other companies working in HR tech.

At last, not to mention, interpretability of machine learning models has become critical with increasing adoption of AI to real-world problems. I don’t want to stretch but this also raises a question on should we start using AI as an introspective tool? 

Let me know your thoughts by commenting or through LinkedIn.


An AI evangelist and a multi-disciplinary engineer. Loves to read business and psychology during leisure time. Connect with him any time on LinkedIn for a quick chat on AI!