Imbalanced data and credit card fraud detection
In 2018, just under five million people fell victim to debit or credit card fraud in the UK – with over £2 billion stolen in total, averaging £833 per person. By 2025, the global losses to credit card fraud are expected to reach almost $50 billion.
While Mastercard and VISA’s chip-enabled cards have been effective at combating physical credit card crime, the online world continues to be afflicted. Now, companies are looking for a solution that can better detect and prevent fraudulent transactions – and many have turned their attention to solutions that utilise machine learning techniques.
Minority classes in the imbalanced gym
Detecting fraudulent transactions in a large dataset poses a problem because they are a minority class. For example, there may only be 1,000 cases of fraud in every million transactions, representing a minute fraction (0.1%) of the full dataset.
In data science, these imbalanced datasets can be very difficult to analyse, because machine learning algorithms tend to show a bias for the majority class, leading to misleading conclusions.
For example, imagine you are the manager of a gym and the gym owner asks you to predict the likelihood of each customer renewing their gym membership at the end of the year. To do this, you will need to examine existing data on each member – e.g. frequency of visits, joining date, equipment preferences, etc. – to determine whether they fall into one of two categories: will renew or will not renew.
Analysing this data is made difficult by the fact that the gym has an abnormally high retention rate: 99% of customers have renewed their membership to date. Simply put, the non-renewers are the minority class.
A quick and easy way to proceed in this circumstance would be to predict that 100% of gym members will renew in the next year, giving you a 99% accuracy rate. Sounds great, right?
But this model doesn’t hold up because the gym manager – or algorithm – has failed to learn anything about which gym members are least likely to renew their membership. So although the prediction has a “good accuracy rate”, it ultimately delivers no value.
Over, under and GAN
Traditionally, there have been two popular ways of handling imbalanced datasets: oversampling and undersampling.
Oversampling is achieved by artificially generating new observations in the data set belonging to the underrepresented class (e.g. fraudulent transactions). There are a number of techniques that data scientists use for oversampling, including SMOTE (Synthetic Minority Over-sampling Technique), which can create synthetic observations of the minority class.
Undersampling works in the opposite way: it removes the number of samples in the overrepresented class (e.g. non-fraudulent transactions) to “balance” the dataset. The easiest way to undersample is to randomly remove observations from the majority class – but with this technique, datasets need be large enough to mitigate the effects of the removal of data points.
Augmentation based on generative adversarial networks (GANs) is another technique that is growing in popularity. While oversampling creates synthetic observations that are almost identical to the original observations in the minority class, GANs look to take this one step further and generate new, unique observations that look and behave even more like real-life data.
This technology was recently utilised to produce photographs of artificial faces via the website This Person Does Not Exist. The underlying code – StyleGAN – was written by Nvidia and uses a dataset of celebrity faces to produce unique images with randomly-tweaked visual features (e.g. shape, size, pose and hair colour). The result is both an astonishing but also slightly unsettling series of hyper-realistic – but totally fake – human headshots.
At Hazy we have an array of proprietary synthetic data generation algorithms that extend the capabilities of GAN and other related algorithms. These models integrate with our synthetic data and model optimisation tooling, enabling us to select the best possible generation algorithm for each specific use case. The resulting Hazy data is therefore optimised for each client’s data structure as well as the problem they are looking to solve.
What this means for credit card fraud
Banks and financial institutions are in need of a solution that can rebalance their datasets and correctly identify both fraudulent and non-fraudulent transactions. But at the same time, it’s essential that the algorithms are able to detect false negatives and false positives.
False negatives describe predictions that are incorrectly flagged as negative. In credit card fraud, it could mean that a fraudulent transaction goes undetected and the fraudster successfully steals money from a customer’s account.
False positives occur when the algorithm incorrectly identifies a positive prediction when it is actually negative. This would likely result in the bank blocking a customer’s account for fraudulent behaviour when actually there was none.
Ultimately, if the data is imbalanced, even a model with a 99% accuracy rate will let a significant number of false negatives and false positives slip through the net – and only a balanced dataset can deliver the fast and effective solution warranted by the abundance of fraud in the world of finance.
Applications outside of finance
The application for rebalancing imbalanced datasets goes far and wide. In fact, any industry where valuable insights can be gleaned from rare events will experience imbalanced data problems in statistical models.
For example, the insurance industry is built on modelling risk. Rare events like extreme weather or train derailments are difficult to predict in current models, but ultimately they could have a significant effect on pricing.
Healthcare professionals also have a notoriously difficult time detecting rare genetic diseases, because imbalance designates them the minority class. In this industry, where even a single false negative can mean a patient goes undiagnosed, applying effective rebalancing algorithms to patient data could quite literally be the difference between life and death.
The future is balanced
Investment in technology for fraud detection has both increased and evolved over the years. We now have complex techniques within data science that attempt to address the issue of data imbalance, such as oversampling and undersampling, and even more sophisticated technologies appear to be on the horizon.
Whatever method data scientists gravitate towards, the desired outcome is data that acts and behaves naturally – i.e. a dataset that is the statistical equivalent of one collected in the real world. Without this, a significant number of fraudulent credit card transactions will continue to go undetected.