Creating a Boosted Tree Model to Predict Fraud in Financial Transactions

Creating a Boosted Tree Model to Predict Fraud in Financial Transactions

In today's world, fraud has become a significant issue in the financial sector. Every year, companies and individuals lose billions of dollars due to fraudulent activities. To combat this, machine learning can be used to detect fraud in financial transactions. In this blog post, we will walk you through the process of creating a boosted tree model to predict fraud in financial transactions using TensorFlow Enterprise 1.15 without GPUs.


Before we dive into the technical details, let's understand the importance of a fraud detection model in the financial sector. The traditional fraud detection methods used by financial institutions are rule-based systems that check for specific patterns or rules in the transactions. However, these rule-based systems are not efficient in detecting new types of fraud or suspicious activities that do not follow the predefined rules. This is where machine learning comes into the picture. Machine learning models can analyze large volumes of data and identify patterns that are difficult for humans to detect. This makes machine learning an ideal tool for detecting fraudulent activities in financial transactions.


To get started with creating a fraud detection model, we first need to download a sample database of financial transaction data. We will be using the fraud_data_kaggle.csv file for this tutorial. To download the file, we will use the following code:


!gsutil cp gs://financial_fraud_detection/fraud_data_kaggle.csv .

 

Once we have downloaded the data, we need to prepare it for training. The sample data is imbalanced, which can lead to an inaccurate model. We will correct this by using downsampling and then splitting the data into a training set and a testing set. To prepare the data, we will use the following code:

 

import uuid
import itertools
import numpy as np
import pandas as pd
import os
import tensorflow as tf
import json
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix
os.environ['TF_CPP_MIN_LOG_LEVEL']='3'
data = pd.read_csv('fraud_data_kaggle.csv')
# Split the data into 2 DataFrames
fraud = data[data['isFraud'] == 1]
not_fraud = data[data['isFraud'] == 0]
# Take a random sample of non fraud rows
not_fraud_sample = not_fraud.sample(random_state=2, frac=.005)
# Put it back together and shuffle
df = pd.concat([not_fraud_sample,fraud])
df = shuffle(df, random_state=2)
# Remove a few columns (isFraud is the label column we'll use, not isFlaggedFraud)
df = df.drop(columns=['nameOrig', 'nameDest', 'isFlaggedFraud'])
# Add transaction id to make it possible to map predictions to transactions
df['transactionId'] = [str(uuid.uuid4()) for _ in range(len(df.index))]
train_test_split = int(len(df) * .8)
# Split the dataset for training and testing
train_set = df[:train_test_split]
test_set = df[train_test_split:]
train_labels = train_set.pop('isFraud')
test_labels = test_set.pop('isFraud')
train_set.head()


After the code completes, it will output several example rows of the processed data.

Now that we have prepared the data, we can create and train our model. We will be using a boosted tree model for this tutorial. Boosted trees are an ensemble learning method that combines multiple decision trees to improve predictive accuracy. To create and train the model, we will use the following code:

 

# Define features
fc = tf.feature_column
CATEGORICAL_COLUMNS = ['type']
NUMERIC_COLUMNS = ['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']
KEY_COLUMN = 'transactionId'
def one_hot_cat_column(feature_name, vocab):
    return tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocab))
feature_columns = []
for feature_name in CATEGORICAL_COLUMNS:
    vocabulary = train_set[feature_name].unique()
    feature_columns.append(one_hot_cat_column(feature_name, vocabulary))
for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name,
                                           dtype=tf.float32))
# Define training and evaluation input functions
NUM_EXAMPLES = len(train_labels)
def make_input_fn(X, y, n_epochs=None, shuffle=True):
  def input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))
    if shuffle:
      dataset = dataset.shuffle(NUM_EXAMPLES)
    dataset = dataset.repeat(n_epochs)
    dataset = dataset.batch(NUM_EXAMPLES)
    return dataset
  return input_fn
train_input_fn = make_input_fn(train_set, train_labels)
eval_input_fn = make_input_fn(test_set, test_labels, shuffle=False, n_epochs=1)
# Define the model
n_batches = 1
model = tf.estimator.BoostedTreesClassifier(feature_columns,
                                          n_batches_per_layer=n_batches)
model = tf.contrib.estimator.forward_features(model,KEY_COLUMN)
# Train the model
model.train(train_input_fn, max_steps=100)
# Get metrics to evaluate the model's performance
result = model.evaluate(eval_input_fn)
print(pd.Series(result))

 

After the code completes, it will output a set of metrics that describe the model's performance. We should see accuracy and AUC values around 99%.

Once we have trained our model, we need to test it to verify that it labels the fraudulent transactions correctly. We will use the testing set for this purpose. To test the model, we will use the following code:


 

pred_dicts = list(model.predict(eval_input_fn))
probabilities = pd.Series([pred['logistic'][0] for pred in pred_dicts])
for i,val in enumerate(probabilities[:30]):
  print('Predicted: ', round(val), 'Actual: ', test_labels.iloc[i])
  print() 

 

After testing the model, we can export it to Cloud Storage in the form of a SavedModel. We will use the following code to export the model:

 


GCP_PROJECT = 'myProject'
MODEL_BUCKET = 'gs://myProject-bucket'
!gsutil mb $MODEL_BUCKET
def json_serving_input_fn():
    feature_placeholders = {
        'type': tf.placeholder(tf.string, [None]),
        'step': tf.placeholder(tf.float32, [None]),
        'amount': tf.placeholder(tf.float32, [None]),
        'oldbalanceOrg': tf.placeholder(tf.float32, [None]),
        'newbalanceOrig': tf.placeholder(tf.float32, [None]),
        'oldbalanceDest': tf.placeholder(tf.float32, [None]),
        'newbalanceDest': tf.placeholder(tf.float32, [None]),
         KEY_COLUMN: tf.placeholder_with_default(tf.constant(['nokey']), [None])
    }
    features = {key: tf.expand_dims(tensor, -1)
                for key, tensor in feature_placeholders.items()}
    return tf.estimator.export.ServingInputReceiver(features,feature_placeholders)
export_path = model.export_saved_model(
    MODEL_BUCKET + '/explanations-with-key',
    serving_input_receiver_fn=json_serving_input_fn
).decode('utf-8')
!saved_model_cli show --dir $export_path --all 

 

Now that we have our model exported to Cloud Storage, we can deploy it for predictions. We will use AI Platform for this purpose. To deploy the model, we will use the following code:

 


MODEL = 'fraud_detection_with_key'
!gcloud ai-platform models create $MODEL
VERSION = 'v1'
!gcloud beta ai-platform versions create $VERSION \
--model $MODEL \
--origin $export_path \
--runtime-version 1.15 \
--framework TENSORFLOW \
--python-version 3.7 \
--machine-type n1-standard-4 \
--num-paths 10
!gcloud ai-platform versions describe $VERSION --model $MODEL 

 

It takes a few minutes for the model version to be created.

Finally, we will create a Dataflow pipeline that reads financial transaction data, requests fraud prediction information for each transaction from the AI Platform model, and then writes both transaction and fraud prediction data to BigQuery for analysis. To create the pipeline, we need to create the BigQuery dataset and tables and the Pub/Sub topic and subscription. We will use the following SQL statements to create the transactions and fraud_prediction tables:

 

[CREATE OR REPLACE TABLE fraud_detection.transactions (
   step INT64,
   nameOrig STRING,
   nameDest STRING,
   isFlaggedFraud INT64,
   isFraud INT64,
   type STRING]
 

Comments