Detection With Stats
Anomaly Detection with Statistics
Using Standard Deviation
Taking this series of numbers:
2, 3, 5, 2, 3, 12, 5, 3, 4
We can see that the outlier is 12. Every other number is between 2 and 5.
We can find a good center point by finding the average of the array:
SELECT avg(n)
FROM unnest(array[2, 3, 5, 2, 3, 12, 5, 3, 4]) AS n;
 
       avg
────────────────────
4.3333333333333333We can also use the median here, since it’s less weighted by extremes:
SELECT percentile_disc(0.5) within group(order by n)
FROM unnest(ARRAY[2, 3, 5, 2, 3, 120, 5, 3, 4]) as n;
 
 median
────────
      3Next, we can find the standard deviation:
SELECT stddev(n)
FROM unnest(array[2, 3, 5, 2, 3, 12, 5, 3, 4]) AS n;
 
      stddev
────────────────────
3.0822070014844882We can find out which numbers are outside the stddev and avg like so:
WITH series AS (
   SELECT *
   FROM unnest(array[2, 3, 5, 2, 3, 12, 5, 3, 4]) AS n
),
bounds AS (
   SELECT
       avg(n) - stddev(n) AS lower_bound,
       avg(n) + stddev(n) AS upper_bound
   FROM
       series
)
SELECT
   n,
   n NOT BETWEEN lower_bound AND upper_bound AS is_anomaly
FROM
   series,
   bounds;
 
n  │ is_anomaly
───┼────────────
 2 │ f
 3 │ f
 5 │ f
 2 │ f
 3 │ f
12 │ t
 5 │ f
 3 │ f
 4 │ fWe can also use the MAD, or Median absolute deviation to find
anomalies instead of an average + stddev. MAD calculates the amount of
medians
Consider the dataset here:
1, 1, 2, 2, 4, 6, 9
This has a median of 2:
The absolute deviations from 2 are as follows:
0, 0, 1, 1, 2, 4, 7
We see that the median of these values is again 1. The MAD for this
data is 1, so we see that 4 is 2 deviations away, and 9 is 7 deviations
away.
Using Z-Score
We can use Z-Score to do this as well:
Z-Score represents the amount of Standard Deviations away from the mean a certain item is:
WITH series AS (
   SELECT *
   FROM unnest(array[2, 3, 5, 2, 3, 12, 5, 3, 4]) AS n
),
stats AS (
   SELECT
       avg(n) series_mean,
       stddev(n) as series_stddev
   FROM
       series
)
SELECT
   n,
   (n - series_mean) / series_stddev as zscore
FROM
   series,
   stats;
 
n  │         zscore
───┼─────────────────────────
 2 │ -0.75703329861022517346
 3 │ -0.43259045634870009448
 5 │  0.21629522817435006346
 2 │ -0.75703329861022517346
 3 │ -0.43259045634870009448
12 │      2.4873951240050256
 5 │  0.21629522817435006346
 3 │ -0.43259045634870009448
 4 │ -0.10814761408717501551As we can see, 12 is more than 1 stddev away from the mean, so we know it’s anomalous.
We can change this a bit to see if something is anomalous in .5 stddev, 1 stddev, and 3 stddev.
WITH series AS (
   SELECT *
   FROM unnest(array[2, 3, 5, 2, 3, 12, 5, 3, 4]) AS n
),
stats AS (
   SELECT
       avg(n) series_avg,
       stddev(n) as series_stddev
   FROM
       series
),
zscores AS (
   SELECT
       n,
       (n - series_avg) / series_stddev AS zscore
   FROM
       series,
       stats
)
SELECT
   *,
   zscore NOT BETWEEN -0.5 AND 0.5 AS is_anomaly_0_5,
   zscore NOT BETWEEN -1 AND 1 AS is_anomaly_1,
   zscore NOT BETWEEN -3 AND 3 AS is_anomaly_3
FROM
   zscores;
n  │         zscore          │ is_anomaly_0_5 │ is_anomaly_1 │ is_anomaly_3
───┼─────────────────────────┼────────────────┼──────────────┼──────────────
 2 │ -0.75703329861022517346 │ t              │ f            │ f
 3 │ -0.43259045634870009448 │ f              │ f            │ f
 5 │  0.21629522817435006346 │ f              │ f            │ f
 2 │ -0.75703329861022517346 │ t              │ f            │ f
 3 │ -0.43259045634870009448 │ f              │ f            │ f
12 │      2.4873951240050256 │ t              │ t            │ f
 5 │  0.21629522817435006346 │ f              │ f            │ f
 3 │ -0.43259045634870009448 │ f              │ f            │ f
 4 │ -0.10814761408717501551 │ f              │ f            │ fLet’s say we want to check if we have a server problem with pure SQL statistics:
500 ⇒ problem on the server? 400 ⇒ problem with clients? 404 ⇒ problem with SEO or redirects? 200 ⇒ significant traffic or DOS attack?
Create a table to store statistics:
CREATE TABLE server_log_summary AS (
   period timestamptz,
   status_code int,
   entries int
);We can create an axis to join to with a CTE (Common Table Expression)
-- Correct!
WITH axis AS (
   SELECT
       status_code,
       generate_series(
           date_trunc('minute', now()),
           date_trunc('minute', now() - interval '1 hour'),
           interval '1 minute' * -1
       ) AS period
   FROM (
       VALUES (200), (400), (404), (500)
   ) AS t(status_code)
)
SELECT
   a.period,
   a.status_code,
   count(*) AS entries
FROM
   axis a
   LEFT JOIN server_log l ON (
       date_trunc('minute', l.timestamp) = a.period
       AND l.status_code = a.status_code
   )
GROUP BY
   period,
   status_code;We might try to use the Z-Score to find out if we have an anomaly:
db= WITH stats AS (
   SELECT
       status_code,
       (MAX(ARRAY[EXTRACT('epoch' FROM period), entries]))[2] AS last_value,
       AVG(entries) AS mean_entries,
       STDDEV(entries) AS stddev_entries
   FROM
       server_log_summary
   WHERE
       -- In the demo data use:
       -- period > '2020-08-01 17:00 UTC'::timestamptz
       period > now() - interval '1 hour'
   GROUP BY
       status_code
)
SELECT * FROM stats;
 
status_code │ last_value │      mean_entries      │     stddev_entries
────────────┼────────────┼────────────────────────┼────────────────────────
        404 │          0 │ 0.13333333333333333333 │ 0.34280333180088158345
        500 │          0 │ 0.15000000000000000000 │ 0.36008473579027553993
        200 │       4084 │  2779.1000000000000000 │       689.219644702665
        400 │         24 │ 0.73333333333333333333 │     3.4388935285299212We then calculate if this is anomalous by checking the difference between the previous value:
db= WITH stats AS (
   SELECT
       status_code,
       (MAX(ARRAY[EXTRACT('epoch' FROM period), entries]))[2] AS last_value,
       AVG(entries) AS mean_entries,
       STDDEV(entries) AS stddev_entries
   FROM
       server_log_summary
   WHERE
       -- In the demo data use:
       -- period > '2020-08-01 17:00 UTC'::timestamptz
       period > now() - interval '1 hour'
   GROUP BY
       status_code
)
SELECT
   *,
   (last_value - mean_entries) / NULLIF(stddev_entries::float, 0) as zscore
FROM
   stats;
 
status_code │ last_value │ mean_entries │ stddev_entries │  zscore
────────────┼────────────┼──────────────┼────────────────┼────────
        404 │          0 │ 0.133        │ 0.3428         │ -0.388
        500 │          0 │ 0.150        │ 0.3600         │ -0.416
        200 │       4084 │ 2779.100     │ 689.2196       │  1.893
        400 │         24 │ 0.733        │ 3.4388         │  6.765Clearly the 400s are anomalous, and also, we seem to have a bunch more 200s than normal.