Detection With Stats
Taking this series of numbers:
2, 3, 5, 2, 3, 12, 5, 3, 4
We can see that the outlier is 12. Every other number is between 2 and 5.
We can find a good center point by finding the average of the array:
SELECT avg(n)
FROM unnest(array[2, 3, 5, 2, 3, 12, 5, 3, 4]) AS n;
avg
────────────────────4.3333333333333333
We can also use the median here, since it's less weighted by extremes:
SELECT percentile_disc(0.5) within group(order by n)
FROM unnest(ARRAY[2, 3, 5, 2, 3, 120, 5, 3, 4]) as n;
median
────────3
Next, we can find the standard deviation:
SELECT stddev(n)
FROM unnest(array[2, 3, 5, 2, 3, 12, 5, 3, 4]) AS n;
stddev
────────────────────3.0822070014844882
We can find out which numbers are outside the stddev and avg like so:
WITH series AS (
SELECT *
FROM unnest(array[2, 3, 5, 2, 3, 12, 5, 3, 4]) AS n
),AS (
bounds SELECT
avg(n) - stddev(n) AS lower_bound,
avg(n) + stddev(n) AS upper_bound
FROM
series
)SELECT
n,NOT BETWEEN lower_bound AND upper_bound AS is_anomaly
n FROM
series,
bounds;
n │ is_anomaly
───┼────────────2 │ f
3 │ f
5 │ f
2 │ f
3 │ f
12 │ t
5 │ f
3 │ f
4 │ f
We can also use the MAD
, or Median absolute deviation to
find anomalies instead of an average + stddev. MAD
calculates the amount of medians
Consider the dataset here:
1, 1, 2, 2, 4, 6, 9
This has a median of 2:
The absolute deviations from 2 are as follows:
0, 0, 1, 1, 2, 4, 7
We see that the median of these values is again 1. The
MAD
for this data is 1, so we see that 4 is 2 deviations
away, and 9 is 7 deviations away.
We can use Z-Score to do this as well:
Z-Score represents the amount of Standard Deviations away from the mean a certain item is:
WITH series AS (
SELECT *
FROM unnest(array[2, 3, 5, 2, 3, 12, 5, 3, 4]) AS n
),AS (
stats SELECT
avg(n) series_mean,
stddev(n) as series_stddev
FROM
series
)SELECT
n,- series_mean) / series_stddev as zscore
(n FROM
series,
stats;
n │ zscore
───┼─────────────────────────2 │ -0.75703329861022517346
3 │ -0.43259045634870009448
5 │ 0.21629522817435006346
2 │ -0.75703329861022517346
3 │ -0.43259045634870009448
12 │ 2.4873951240050256
5 │ 0.21629522817435006346
3 │ -0.43259045634870009448
4 │ -0.10814761408717501551
As we can see, 12 is more than 1 stddev away from the mean, so we know it's anomalous.
We can change this a bit to see if something is anomalous in .5 stddev, 1 stddev, and 3 stddev.
WITH series AS (
SELECT *
FROM unnest(array[2, 3, 5, 2, 3, 12, 5, 3, 4]) AS n
),AS (
stats SELECT
avg(n) series_avg,
stddev(n) as series_stddev
FROM
series
),AS (
zscores SELECT
n,- series_avg) / series_stddev AS zscore
(n FROM
series,
stats
)SELECT
*,
NOT BETWEEN -0.5 AND 0.5 AS is_anomaly_0_5,
zscore NOT BETWEEN -1 AND 1 AS is_anomaly_1,
zscore NOT BETWEEN -3 AND 3 AS is_anomaly_3
zscore FROM
zscores;
n │ zscore │ is_anomaly_0_5 │ is_anomaly_1 │ is_anomaly_3
───┼─────────────────────────┼────────────────┼──────────────┼──────────────2 │ -0.75703329861022517346 │ t │ f │ f
3 │ -0.43259045634870009448 │ f │ f │ f
5 │ 0.21629522817435006346 │ f │ f │ f
2 │ -0.75703329861022517346 │ t │ f │ f
3 │ -0.43259045634870009448 │ f │ f │ f
12 │ 2.4873951240050256 │ t │ t │ f
5 │ 0.21629522817435006346 │ f │ f │ f
3 │ -0.43259045634870009448 │ f │ f │ f
4 │ -0.10814761408717501551 │ f │ f │ f
Let's say we want to check if we have a server problem with pure SQL statistics:
500 => problem on the server? 400 => problem with clients? 404 => problem with SEO or redirects? 200 => significant traffic or DOS attack?
Create a table to store statistics:
CREATE TABLE server_log_summary AS (
period timestamptz,int,
status_code int
entries );
We can create an axis to join to with a CTE (Common Table Expression)
-- Correct!
WITH axis AS (
SELECT
status_code,
generate_series('minute', now()),
date_trunc('minute', now() - interval '1 hour'),
date_trunc(interval '1 minute' * -1
AS period
) FROM (
VALUES (200), (400), (404), (500)
AS t(status_code)
)
)SELECT
a.period,
a.status_code,count(*) AS entries
FROM
axis aLEFT JOIN server_log l ON (
'minute', l.timestamp) = a.period
date_trunc(AND l.status_code = a.status_code
)GROUP BY
period, status_code;
We might try to use the Z-Score to find out if we have an anomaly:
= WITH stats AS (
dbSELECT
status_code,MAX(ARRAY[EXTRACT('epoch' FROM period), entries]))[2] AS last_value,
(AVG(entries) AS mean_entries,
STDDEV(entries) AS stddev_entries
FROM
server_log_summaryWHERE
-- In the demo data use:
-- period > '2020-08-01 17:00 UTC'::timestamptz
> now() - interval '1 hour'
period GROUP BY
status_code
)SELECT * FROM stats;
last_value │ mean_entries │ stddev_entries
status_code │
────────────┼────────────┼────────────────────────┼────────────────────────404 │ 0 │ 0.13333333333333333333 │ 0.34280333180088158345
500 │ 0 │ 0.15000000000000000000 │ 0.36008473579027553993
200 │ 4084 │ 2779.1000000000000000 │ 689.219644702665
400 │ 24 │ 0.73333333333333333333 │ 3.4388935285299212
We then calculate if this is anomalous by checking the difference between the previous value:
= WITH stats AS (
dbSELECT
status_code,MAX(ARRAY[EXTRACT('epoch' FROM period), entries]))[2] AS last_value,
(AVG(entries) AS mean_entries,
STDDEV(entries) AS stddev_entries
FROM
server_log_summaryWHERE
-- In the demo data use:
-- period > '2020-08-01 17:00 UTC'::timestamptz
> now() - interval '1 hour'
period GROUP BY
status_code
)SELECT
*,
last_value - mean_entries) / NULLIF(stddev_entries::float, 0) as zscore
(FROM
stats;
last_value │ mean_entries │ stddev_entries │ zscore
status_code │
────────────┼────────────┼──────────────┼────────────────┼────────404 │ 0 │ 0.133 │ 0.3428 │ -0.388
500 │ 0 │ 0.150 │ 0.3600 │ -0.416
200 │ 4084 │ 2779.100 │ 689.2196 │ 1.893
400 │ 24 │ 0.733 │ 3.4388 │ 6.765
Clearly the 400s are anomalous, and also, we seem to have a bunch more 200s than normal.