Anomaly Detection with Salesforce Merlion Package - Unsupervised anomaly detection

Reference:

  • github: https://github.com/salesforce/Merlion

Notes on installing Merlion:

  • according to the github page,using pip install salesforce-merlion should be sufficient to have the Merlion package installed.
  • however, on Windows machine, an error can occur due to Merlion's dependency package fbprophet.
  • in order to have merlion package be installed successfully, we need to first install fbprophet package. This stack overflow page provides useful tricks to fix issues with installing fbprophet package on Windows machine.
  • what did not work for me: first run pip install pystan==2.18.0.0, then run pip install fbprophet.
  • what worked for me:
    • first run pip install pystan==2.17.1.0. This step will uninstall whatever version of pystan package on the machine and isntall the version specified in the pip command.
    • then run pip install fbprophet. This step will retrieve the latest pystan version, uninstall the version installed from previous step and install the latest version. The successfuly installation message Successfully installed cmdstanpy-0.9.68 prophet-1.0.1 pystan-2.19.1.1.

Steps

  1. download market data using yfinance: download S&P 500 (‘^GSPC')
  2. calculate return 20 day max return (i.e. target in supervised learning problem):
    • for each date (T):
      • calculate the max price change in next 20 trading dates: price_change = (max{close price in T+1 to T+20} - {close price on T})/({close price on T})
  3. use Merlion to do unsupervised anomaly detection
    1. Initializing an anomaly detection model (including ensembles)
    2. Training the model
    3. Producing a series of anomaly scores with the model
import numpy as np
import pandas as pd
import statsmodels.api as sm

from datetime import datetime, timedelta
import yfinance as yf #to download stock price data
import matplotlib.pyplot as plt

from merlion.plot import plot_anoms
from merlion.utils import TimeSeries
np.random.seed(5678)

download S&P 500 price data

ticker = '^GSPC'
cur_data = yf.Ticker(ticker)
hist = cur_data.history(period="max")
print(ticker, hist.shape, hist.index.min())
^GSPC (19720, 7) 1927-12-30 00:00:00 
df=hist[hist.index>='2000-01-01'].copy(deep=True)
df.head()
Open High Low Close Volume Dividends Stock Splits
Date
2000-01-03 1469.250000 1478.000000 1438.359985 1455.219971 931800000 0 0
2000-01-04 1455.219971 1455.219971 1397.430054 1399.420044 1009000000 0 0
2000-01-05 1399.420044 1413.270020 1377.680054 1402.109985 1085500000 0 0
2000-01-06 1402.109985 1411.900024 1392.099976 1403.449951 1092300000 0 0
2000-01-07 1403.449951 1441.469971 1400.729980 1441.469971 1225200000 0 0

calcualte max return in next 20 trading days

#for each stock_id, get the max close in next 20 trading days
price_col = 'Close'
roll_len=20
new_col = 'next_20day_max'
target_list = []

df.sort_index(ascending=True, inplace=True)
df.head(3)
Open High Low Close Volume Dividends Stock Splits
Date
2000-01-03 1469.250000 1478.000000 1438.359985 1455.219971 931800000 0 0
2000-01-04 1455.219971 1455.219971 1397.430054 1399.420044 1009000000 0 0
2000-01-05 1399.420044 1413.270020 1377.680054 1402.109985 1085500000 0 0
df_next20dmax=df[[price_col]].shift(1).rolling(roll_len).max()
df_next20dmax.columns=[new_col]
df = df.merge(df_next20dmax, right_index=True, left_index=True, how='inner')

df.dropna(how='any', inplace=True)
df['target']= 100*(df[new_col]-df[price_col])/df[price_col]  
df.head(3)
Open High Low Close Volume Dividends Stock Splits next_20day_max_x target next_20day_max_y next_20day_max
Date
2000-03-29 1507.729980 1521.449951 1497.449951 1508.520020 1061900000 0 0 1527.459961 1.255531 1527.459961 1527.459961
2000-03-30 1508.520020 1517.380005 1474.630005 1487.920044 1193400000 0 0 1527.459961 2.657395 1527.459961 1527.459961
2000-03-31 1487.920044 1519.810059 1484.380005 1498.579956 1227400000 0 0 1527.459961 1.927158 1527.459961 1527.459961
df['target'].plot.line(figsize=(12,5))
<AxesSubplot:xlabel='Date'>

png

df['target'].hist(bins=100)
<AxesSubplot:>

png

Merlion: Anomaly detection - unsupervised with Isolation Forest

train_data = TimeSeries.from_pd(df[['target']].iloc[:-200])
test_data = TimeSeries.from_pd(df[['target']].iloc[-200:])
# Import models & configs
from merlion.models.anomaly.isolation_forest import IsolationForest, IsolationForestConfig


# isolation forest
iso_forest_config = IsolationForestConfig()
iso_forest_model  = IsolationForest(iso_forest_config)
iso_forest_train_score = iso_forest_model.train(train_data=train_data, anomaly_labels=None)
iso_forest_train_score.to_pd().plot.line(figsize=(12,5))
<AxesSubplot:>

png

  • Model Inference
    • model.get_anomaly_score() returns the model's raw anomaly scores,
    • model.get_anomaly_label() returns the model's post-processed anomaly scores. The post-processing calibrates the anomaly scores to be interpretable as z-scores, and it also sparsifies them such that any nonzero values should be treated as an alert that a particular timestamp is anomalous.
test_scores = iso_forest_model.get_anomaly_score(test_data)
test_scores_df = test_scores.to_pd()

test_labels = iso_forest_model.get_anomaly_label(test_data)
test_labels_df = test_labels.to_pd()
test_scores_df.plot.line(figsize=(12,5))
<AxesSubplot:>

png

test_labels_df.value_counts()
anom_score
0.0           199
dtype: int64