/Blown by Imbalance – Part 1

Blown by Imbalance – Part 1

Imagine a bank wants to predict which of its future customers would default on a loan granted by the bank. The bank would already have historical data from the past years that says how many of the customers have defaulted, what type of customer were they and many other information about the past loans.

The bank has data having 1000 rows that contains information like agegenderincomemarital-status, other related columns and then a variable named default that says whether the customer has defaulted or not. This is a typical binary classification problem.

There are two levels in the default variable that we are trying to predict- default and not-default. One has to define first whether default will be treated as positive or not-default as positive. It is just a convention. Normally, the class of interest is treated as positive. Hence we will treat default as positive and not-default as negative.

We train a binary classification model on this dataset having 1000 rows. The predictions that the model makes for the training data will either be correct or incorrect.

There will be two cases for incorrect predictions

  • False postive – Predicting positive when the actual was negative i.e classifying a customer as default when in actual he is not-default
  • False negative – Predicting negative when the actual was positive i.e classifying a customer as not-default when in actual he is default

There will be two cases for correct predictions

  • True positive – Predicting positive when the actual was positive i.e correctly classifying default as default
  • True negative – Predicting negative when the actual was negative i.e correctly classifying not-default as not-default

    Now, FP, FN are incorrect predictions (notice False in the name) as the name suggests and TP, TN are the correct predictions. Don’t be in a hurry here. Take some time to digest these 2 lettered acronyms. Read them loud. Take a notebook and write them down on your own.

Once we are comfortable with these terms we will discuss about something called confusion-matrix. Don’t get confused yet. If you understood TP, TN, FP, FN then confusion-matrix is just a matrix having these values. The diagonal elements contain the count of correct predictions (TP, TN) whereas the off-diagonal contain the count of incorrect predictions (FP, FN). The rows are the predicted and the columns are the actual. This looks something like this.

Why we need TPR and FPR if we already have mis-classification error?

The data I describe above is a typical case of imbalanced data wherein one of the class is having majority of observations (90% non-defaulters (negatives) in our data) and the remaining class is a minority (only 10% defaulters (positives)). In such cases, the predictions on new dataset will be skewed towards negatives i.e the model will classify a lot of defaulters (positives) into negatives. The bank can’t afford to have such predictions. The bank wants to know for sure the defaulters (positives). Imagine, the loss to bank if the model classify a probable defaulter (positive) to non-defaulter.

In such cases, accuracy corresponding to mis-classification alone is not acceptable. The bank would be more interested in correctly classifying the positives into positive i.e the bank wants to classify the defaulters into defaulters without fail.

Comes into picture TPRTNRFPRFNR. These 3 lettered acronyms are nothing but the rates of TP, TN, FP and FN respectively. Below is the formula. To digest the formula, let’s move to our data having 1000 rows – 100 defaulters (positives) and 900 non-defaulters(negatives). Suppose we employed a logistic regression that classified 80 defaulters correctly and incorrectly classified 90 non-defaulters as defaulters. TPR = 80 / 100 = 0.8 FPR = 90/ 900 = 0.1

How does the TPR and FPR gets calculated?

Whenever you do any classification, the model always gives you probabilities of each observation getting classified in each of the classes. Based on what cut-off you choose, you will get different predictions for the data and hence different TPR and FPR overall. You can choose whatever probability cut-off between [0, 1] and you will get different tuples of TPR and FPR.

TPR and FPR are be generated for each of the probability value one chooses. And these values are then plotted on an ROC curve.

One can plot these tuples (probability, FPR, TPR) on a graph. You know what this graph is called? ROC-Receiver Operating Characteristics. There is a trade-off between TPR and FPR. Depending on the requirement one can choose the probability cut-off that best fulfils their purpose. For instance, in the bank’s case, the bank wants to not miss a single defaulter(positives) i.e the bank wants a higher TPR. The ROC curve looks something like this.

What about AUC?

AUC is nothing but the area under curve of ROC curve. Let’s say we built a logistic regression model that gave us the probabilities for each row. Now we try with probability cut-offs from 0.1 to 1.0 with step size of 0.1 i.e we would have 10 probabilites to try with and corresponding to each of the 10 values we would have corresponding (FPR, TPR). If we plot these values on a graph we would get a graph having 10 points. This 10 point graph is what we call an ROC curve and the area under this graph is called AUC.

The AUC is a common evaluation metric for binary classification problems. Consider a plot of the true positive rate vs the false positive rate as the threshold value for classifying an item as 0 or is increased from 0 to 1: if the classifier is very good, the true positive rate will increase quickly and the area under the curve will be close to 1. If the classifier is no better than random guessing, the true positive rate will increase linearly with the false positive rate and the area under the curve will be around 0.5.

One characteristic of the AUC is that it is independent of the fraction of the test population which is class 0 or class 1: this makes the AUC useful for evaluating the performance of classifiers on unbalanced data sets.

The larger the area the better. If we have to choose between two classifiers having different AUCs, we choose the one having larger AUC.

Choosing the probability cut-off

You choose some probability cut-offs say from 0.5 till 0.9 with some increment say 0.05 and calculate the TPR and FPR corresponding to each probability value.

You have to decide how much TPR and FPR you want. There is a trade-off between the tpr and fpr. If you want to increase TPR, your FPR will also increase. So depending on whether you want to detect all the positives (higher TPR) and willing to incur some error in terms of FPR, you decide the probability cut-off.

See this sklearn example for more info.

An AI evangelist and a multi-disciplinary engineer. Loves to read business and psychology during leisure time. Connect with him any time on LinkedIn for a quick chat on AI!