The Simility Blog

Data-Driven Fraud Prevention Starts With a Data Lake
Jayan Tharayil
May 21, 2020

Data lakes can play a prime role in driving competitive advantages as part of machine learning-powered fraud decisioning.

Data is the fuel that powers modern business. And when it comes to online fraud, it’s highly sought after by both scammers and in-house risk teams. The quality and variety of data fed into a decisioning platform has a significant bearing on how well those platforms can spot fraud patterns. Organizations need to consider how their tools collect and manage data – especially if it’s unstructured, which is increasingly common in today’s landscape.


The volume of data in the world is forecast to increase dramatically, as organizations embrace digital transformation to get closer to their customers and create unique online experiences. Some estimates claim it will reach 50.5ZB by the end of this year, and scale up to 175ZB in five years.(1)

Increasingly, this information is no longer found in structured formats in easily searchable and ordered databases. Rather, it is unstructured — a catch-all term covering everything from the data in text files and emails to social media, websites, mobile, and communications data, business applications, and more — and not found in easily searchable or ordered databases. Imperatively, data from many of these sources are required to feed into fraud prevention tools. 

Scammers have become masters at collecting data from breached companies — usually consumers’ identity information and log-ins — and using it to impersonate people online. Last year, scammers made $16.9 billion from opening fictitious accounts, hijacking users’ accounts, and using stolen credit cards. Experts warn to expect increased damage during the current global health and economic crisis.(2)


Organizations need to get better at spotting such fraud attempts. But too often they’re working with tools built for a long-gone era when structured data was king. These legacy systems use data warehouses, and because they’re only able to harvest structured data, using fraud tools built on top of them generate limited insight. What’s more, they require analysts to write complex statements or SQL queries to modify them, making the whole process more cumbersome than it needs to be.

If organizations can’t spot fraud patterns in a timely and accurate manner, their solutions may allow too many scams through, pushing up chargeback costs and damaging the brand. On the other side, they might flag legitimate transactions as fraudulent (false positives), which can negatively impact sales, lead to customer churn, and increase admin costs if more manual reviews are required.


The key is to invest in fraud solutions that feature data lake capabilities. These systems can draw in data from a wide variety of structured and unstructured sources, at scale and speed. These sources could include business applications like CRM and ERP repositories, social media, device data, server logs, international blacklists, internal whitelists, messaging and comms platforms, and much more.

By combining different data streams across in-house and third-party sources, many of them from identity verification-type services, businesses can generate actionable intelligence. Simility’s flexible data lake is the foundational layer on which we build machine learning models to spot sophisticated fraud patterns. Thanks to its architecture, businesses can pick and choose which data feeds they want to add and include new ones as their needs change — all without specialized coding knowledge or expensive investments.

With this intelligence to hand, organizations can help reduce false positives and fraud losses, improve conversion rates and minimize friction, allowing them to maximize profits at a time when finances for many are stretched to the limit.  

  1. Statista, February 28, 2020:
  2. Javelin Strategy & Research: April 7, 2020: