anomaly-detection-with-stats

Table of Contents

Detection With Stats

Anomaly Detection with Statistics

Using Standard Deviation

Taking this series of numbers:

2, 3, 5, 2, 3, 12, 5, 3, 4

We can see that the outlier is 12. Every other number is between 2 and 5.

We can find a good center point by finding the average of the array:

SELECT avg(n)
FROM unnest(array[2, 3, 5, 2, 3, 12, 5, 3, 4]) AS n;

       avg
────────────────────
4.3333333333333333

We can also use the median here, since it's less weighted by extremes:

SELECT percentile_disc(0.5) within group(order by n)
FROM unnest(ARRAY[2, 3, 5, 2, 3, 120, 5, 3, 4]) as n;

 median
────────
      3

Next, we can find the standard deviation:

SELECT stddev(n)
FROM unnest(array[2, 3, 5, 2, 3, 12, 5, 3, 4]) AS n;

      stddev
────────────────────
3.0822070014844882

We can find out which numbers are outside the stddev and avg like so:

WITH series AS (
   SELECT *
   FROM unnest(array[2, 3, 5, 2, 3, 12, 5, 3, 4]) AS n
),
bounds AS (
   SELECT
       avg(n) - stddev(n) AS lower_bound,
       avg(n) + stddev(n) AS upper_bound
   FROM
       series
)
SELECT
   n,
   n NOT BETWEEN lower_bound AND upper_bound AS is_anomaly
FROM
   series,
   bounds;

n  │ is_anomaly
───┼────────────
 2 │ f
 3 │ f
 5 │ f
 2 │ f
 3 │ f
12 │ t
 5 │ f
 3 │ f
 4 │ f

We can also use the MAD, or Median absolute deviation to find anomalies instead of an average + stddev. MAD calculates the amount of medians

Consider the dataset here:

1, 1, 2, 2, 4, 6, 9

This has a median of 2:

The absolute deviations from 2 are as follows:

0, 0, 1, 1, 2, 4, 7

We see that the median of these values is again 1. The MAD for this data is 1, so we see that 4 is 2 deviations away, and 9 is 7 deviations away.

Using Z-Score

We can use Z-Score to do this as well:

Z-Score represents the amount of Standard Deviations away from the mean a certain item is:

WITH series AS (
   SELECT *
   FROM unnest(array[2, 3, 5, 2, 3, 12, 5, 3, 4]) AS n
),
stats AS (
   SELECT
       avg(n) series_mean,
       stddev(n) as series_stddev
   FROM
       series
)
SELECT
   n,
   (n - series_mean) / series_stddev as zscore
FROM
   series,
   stats;

n  │         zscore
───┼─────────────────────────
 2-0.75703329861022517346
 3-0.43259045634870009448
 50.21629522817435006346
 2-0.75703329861022517346
 3-0.43259045634870009448
122.4873951240050256
 50.21629522817435006346
 3-0.43259045634870009448
 4-0.10814761408717501551

As we can see, 12 is more than 1 stddev away from the mean, so we know it's anomalous.

We can change this a bit to see if something is anomalous in .5 stddev, 1 stddev, and 3 stddev.

WITH series AS (
   SELECT *
   FROM unnest(array[2, 3, 5, 2, 3, 12, 5, 3, 4]) AS n
),
stats AS (
   SELECT
       avg(n) series_avg,
       stddev(n) as series_stddev
   FROM
       series
),
zscores AS (
   SELECT
       n,
       (n - series_avg) / series_stddev AS zscore
   FROM
       series,
       stats
)
SELECT
   *,
   zscore NOT BETWEEN -0.5 AND 0.5 AS is_anomaly_0_5,
   zscore NOT BETWEEN -1 AND 1 AS is_anomaly_1,
   zscore NOT BETWEEN -3 AND 3 AS is_anomaly_3
FROM
   zscores;
n  │         zscore          │ is_anomaly_0_5 │ is_anomaly_1 │ is_anomaly_3
───┼─────────────────────────┼────────────────┼──────────────┼──────────────
 2-0.75703329861022517346 │ t              │ f            │ f
 3-0.43259045634870009448 │ f              │ f            │ f
 50.21629522817435006346 │ f              │ f            │ f
 2-0.75703329861022517346 │ t              │ f            │ f
 3-0.43259045634870009448 │ f              │ f            │ f
122.4873951240050256 │ t              │ t            │ f
 50.21629522817435006346 │ f              │ f            │ f
 3-0.43259045634870009448 │ f              │ f            │ f
 4-0.10814761408717501551 │ f              │ f            │ f

Let's say we want to check if we have a server problem with pure SQL statistics:

500 => problem on the server? 400 => problem with clients? 404 => problem with SEO or redirects? 200 => significant traffic or DOS attack?

Create a table to store statistics:

CREATE TABLE server_log_summary AS (
   period timestamptz,
   status_code int,
   entries int
);

We can create an axis to join to with a CTE (Common Table Expression)

-- Correct!
WITH axis AS (
   SELECT
       status_code,
       generate_series(
           date_trunc('minute', now()),
           date_trunc('minute', now() - interval '1 hour'),
           interval '1 minute' * -1
       ) AS period
   FROM (
       VALUES (200), (400), (404), (500)
   ) AS t(status_code)
)
SELECT
   a.period,
   a.status_code,
   count(*) AS entries
FROM
   axis a
   LEFT JOIN server_log l ON (
       date_trunc('minute', l.timestamp) = a.period
       AND l.status_code = a.status_code
   )
GROUP BY
   period,
   status_code;

We might try to use the Z-Score to find out if we have an anomaly:

db= WITH stats AS (
   SELECT
       status_code,
       (MAX(ARRAY[EXTRACT('epoch' FROM period), entries]))[2] AS last_value,
       AVG(entries) AS mean_entries,
       STDDEV(entries) AS stddev_entries
   FROM
       server_log_summary
   WHERE
       -- In the demo data use:
       -- period > '2020-08-01 17:00 UTC'::timestamptz
       period > now() - interval '1 hour'
   GROUP BY
       status_code
)
SELECT * FROM stats;

status_code │ last_value │      mean_entries      │     stddev_entries
────────────┼────────────┼────────────────────────┼────────────────────────
        40400.133333333333333333330.34280333180088158345
        50000.150000000000000000000.36008473579027553993
        20040842779.1000000000000000689.219644702665
        400240.733333333333333333333.4388935285299212

We then calculate if this is anomalous by checking the difference between the previous value:

db= WITH stats AS (
   SELECT
       status_code,
       (MAX(ARRAY[EXTRACT('epoch' FROM period), entries]))[2] AS last_value,
       AVG(entries) AS mean_entries,
       STDDEV(entries) AS stddev_entries
   FROM
       server_log_summary
   WHERE
       -- In the demo data use:
       -- period > '2020-08-01 17:00 UTC'::timestamptz
       period > now() - interval '1 hour'
   GROUP BY
       status_code
)
SELECT
   *,
   (last_value - mean_entries) / NULLIF(stddev_entries::float, 0) as zscore
FROM
   stats;

status_code │ last_value │ mean_entries │ stddev_entries │  zscore
────────────┼────────────┼──────────────┼────────────────┼────────
        40400.1330.3428-0.388
        50000.1500.3600-0.416
        20040842779.100689.21961.893
        400240.7333.43886.765

Clearly the 400s are anomalous, and also, we seem to have a bunch more 200s than normal.