Anomaly Detection with Boosted Trees: A Tutorial

Anomaly Detection with Boosted Trees: A Tutorial

Anomaly detection is a crucial task in many industries, especially in finance where detecting fraudulent transactions can save billions of dollars. In this tutorial, we will show you how to implement an anomaly detection application that identifies fraudulent transactions by using a boosted tree model. The application will use Google Cloud Platform services such as AI Platform, Dataflow, Pub/Sub, and BigQuery.

Objectives

The objectives of this tutorial are:

  1. Create a boosted tree model that estimates the probability of fraud in financial transactions.
  2. Deploy the model to AI Platform for online prediction.
  3. Use a Dataflow pipeline to:

      Write transaction data from the sample dataset to a transactions table in BigQuery.

      Send microbatched requests to the hosted model to retrieve fraud probability predictions and write the results to a fraud_detection table in BigQuery.

  1. Run a BigQuery query that joins these tables to see the probability of fraud for each transaction.

Dataset

The boosted tree model used in this tutorial is trained on the Synthetic Financial Dataset For Fraud Detection from Kaggle. This dataset was generated using the PaySim simulator. We use a synthetic dataset because there are few financial datasets appropriate for fraud detection, and those that exist often contain personally identifiable information (PII) that needs to be anonymized.

Architecture

The sample application consists of the following components:

  1. A boosted tree model developed using TensorFlow and deployed to AI Platform.
  2. A Dataflow pipeline that completes the following tasks:

      Publishes transaction data from a Cloud Storage bucket to a Pub/Sub topic, then reads that data as a stream from a Pub/Sub subscription to that topic.

      Gets fraud likelihood estimates for each transaction by using the Apache Beam Timer API to micro-batch calls to the AI Platform prediction API.

      Writes transaction data and fraud likelihood data to BigQuery tables for analysis.

The following diagram illustrates the architecture of the anomaly detection solution:




Costs

This tutorial uses the following billable components of Google Cloud:

  1. AI Platform
  2. BigQuery
  3. Cloud Storage
  4. Compute Engine
  5. Dataflow
  6. Pub/Sub

Before You Begin

Before you begin, you need to:

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

      Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

  1. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
  2. Enable the AI Platform Training and Prediction, Cloud Storage, Compute Engine, Dataflow, and Notebooks APIs.
  3. Check the Compute Engine API quotas available in the us-central1 region; you need these quotas to run the Dataflow job used in this tutorial. If you don't have them, request a quota increase.

Boosted Tree Model Development

We will start by creating a boosted tree model using TensorFlow. This model will be used to estimate the probability of fraud in financial transactions. To do this, we will use the Synthetic Financial Dataset For Fraud Detection from Kaggle.

Steps:

·      Create a notebook.

·      Download the sample data.

·      Prepare the data for use in training.

·      Create and train the model.

·      Test the model.

·      Export the model.

·      Deploy the model to AI Platform.

·      Get predictions from the deployed model.

Model Deployment to AI Platform

Once we have developed the model, we will deploy it to AI Platform for online prediction. This will allow us to send transaction data.

Comments