Anomaly detection is a crucial task in many industries, especially in finance where detecting fraudulent transactions can save billions of dollars. In this tutorial, we will show you how to implement an anomaly detection application that identifies fraudulent transactions by using a boosted tree model. The application will use Google Cloud Platform services such as AI Platform, Dataflow, Pub/Sub, and BigQuery.
The objectives of this tutorial are:
○ Write transaction data from the sample dataset to a transactions table in BigQuery.
○ Send microbatched requests to the hosted model to retrieve fraud probability predictions and write the results to a fraud_detection table in BigQuery.
The boosted tree model used in this tutorial is trained on the Synthetic Financial Dataset For Fraud Detection from Kaggle. This dataset was generated using the PaySim simulator. We use a synthetic dataset because there are few financial datasets appropriate for fraud detection, and those that exist often contain personally identifiable information (PII) that needs to be anonymized.
The sample application consists of the following components:
○ Publishes transaction data from a Cloud Storage bucket to a Pub/Sub topic, then reads that data as a stream from a Pub/Sub subscription to that topic.
○ Gets fraud likelihood estimates for each transaction by using the Apache Beam Timer API to micro-batch calls to the AI Platform prediction API.
○ Writes transaction data and fraud likelihood data to BigQuery tables for analysis.
The following diagram illustrates the architecture of the anomaly detection solution:
This tutorial uses the following billable components of Google Cloud:
Before you begin, you need to:
○ Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.
We will start by creating a boosted tree model using TensorFlow. This model will be used to estimate the probability of fraud in financial transactions. To do this, we will use the Synthetic Financial Dataset For Fraud Detection from Kaggle.
Steps:
· Create a notebook.
· Download the sample data.
· Prepare the data for use in training.
· Create and train the model.
· Test the model.
· Export the model.
· Deploy the model to AI Platform.
· Get predictions from the deployed model.
Once we have developed the model, we will deploy it to AI Platform for online prediction. This will allow us to send transaction data.