The Simility Blog

Quantifying Insight: Fraud Data Scientists Translate a Hunch into a Probability
Jayan Tharayil
April 11, 2016

Using credit card fraud as an example, how does a data scientist translate an initial observation into a tangible prediction that can be used to drive business decisions and identify fraud?

Don’t get me wrong, I love The Scientific Method as much as the next person. Carefully ingrained in high school sophomores everywhere, it’s a staple of our education system, along with derivatives and photosynthesis. First, ask a question, next generate a hypothesis, after that carefully run an experiment (in a vacuum if at all possible, please), painstakingly collect some data, analyze it using standard statistical tests, and finally, write up a discussion of how the initial predictions played out.

Each step of The Scientific Method follows the next in a precise, concrete sequence; if something goes wrong at any point, start all over at the very beginning. The researcher always knows exactly what she’s looking for, and she either found it or she didn’t. Case closed.

But the opposite, and far more intriguing, scenario piques my interest much more: what if I have a massive data set, replete with trends, interactions, and real-life outcomes, but don’t know exactly what I’m looking for?

Modern databases are bursting at the seams thanks to Jane Doe’s 1,001 most recent purchases, the timing of her 5 REM sleep cycles each night, what time she’s most likely to buy airplane tickets, and the average screen brightness of her smartphone at any hour of the day. In the past, we’ve relied on the rigor of science to carefully search for answers to one question at a time. Nowadays it’s merely a matter of knowing the right question to ask.

To avoid becoming overwhelmed, it’s essential to set out with a clear goal. Let’s say I’m trying to identify someone committing fraud. Surely, a majority of this aforementioned data is irrelevant to my use case. But undoubtedly, statistics reassures us, the larger the data-set, the higher the chance of an unsuspected correlation.

What if the number of alarms set on Jane’s iPhone clock app can be used to predict her likelihood of using a stolen credit card?

With a couple of quick database queries and the aid of open-source data mining software, I can check if the suspected association exists, whether it’s statistically significant, and if it does, apply my insight to millions of other people who may or may not set multiple alarms to help themselves wake up in the morning. All in the time it takes to pronounce the word “hypothesis”.

Even so, searching for correlations between thousands of columns in a database can sometimes feel like chasing a mirage in the desert. For the sake of simplicity, let’s narrow down this example even further; say I’m looking at user account data and trying to find simple patterns that will predict likelihood of credit card fraud.

I’m definitely not the first one to notice that fraudsters use gibberish email addresses than good users do. Sure, you’re probably thinking, “ghghgh@freemails.com” obviously seems more fraudulent than “janecatherinedoe@gmail.com”. But, how do you go about quantifying that difference? Any data scientist will tell you that there’s no single magical way to translate from number to insight. But aggregating various approximations can oftentimes be good enough.

Here are a few key observations I’ve made about the differences between these particular email addresses:

  • Collectively, my fingers had to move a lot farther over the keyboard to type out the real email address as opposed to the fake one
  • The real email has a decent number of vowels in the “handle” portion (before the “@” symbol), but the fake email has none
  • The real email comes from a common, well-reputed email provider, while the fake one comes from a more questionable domain name
  • The real email is much longer than the fake email

Given the proper software tools, it’s shockingly easy to transcribe each of these ideas into various algorithms and “score” each incoming email on all four of these dimensions (typically each is called a “feature” in this type of analysis). For the sake of proprietary information about our product, I won’t go into the specifics of how I’ve actually implemented these analyses in my role as a Data Scientist at Simility.

Now, given a new email address (say, “qwrty@junkmail.com”), it becomes simple to generate a score on each of these 4 features, and immediately triangulate them to indicate a high fraud probability. Furthermore, this computational simplicity of this algorithm means that we can score any new account’s email practically instantaneously. Now, imagine using hundreds of features instead of just 4. And imagine applying them, not just to email addresses, but also to other data points, such as ips, unique device recon identifiers, zip codes, specific items purchased, etc. And if your brain doesn’t hurt yet, imagine then feeding all these variables into a complex Machine Learning statistical algorithm that uses multiple decision trees, each hundreds of layers deep, to predict fraud probabilities. Given merely a couple data points about any new account, applying these steps allow us to instantaneously generate a highly accurate “likelihood of fraud” score for that user.

Suddenly the daunting task of targeting fraud patterns has become a compelling puzzle! That’s the beauty of Simility. To learn more about intelligent fraud analytics or to see how Simility fights fraud firsthand, please contact us.

Written by Lauren Edelson, Data Scientist at Simility.