Detecting Silent Errors In The Wild Combining Two Novel Approaches To Quickly Detect Silent Data Corruptions At Scale

This paper by Meta discusses detecting Silent Data corruptions in production to catch faulty CPUs and other parts.

Silent Errors

Silent errors don’t trigger machine checks, ECC errors, or log at all, but instead quietly produce wrong answers.
Meta observes these at scale, with an empirical rate of around 1 in 1000 devices, compared to the traditional 1 in 1000000.
These faults can propagate across services and cause data loss or other weird behavior.

The paper notes the testing funnel:

Detection is hard because of:

Some operations fail for only specific bit patterns.
Electrical variations: changes in voltage, frequency, current, can trigger or hide faults
Environment: temperature, humidity, hotspots can expose defects
Silicon aging: devices degrade over time.

So, continuous testing in prod is the only way to explore this state space.

Out of production testing (Fleetscanner)
- Runs a suite of tests when machines are drained, i.e. not in the fleet, for upgrades or repair.
- This is an extensive set of tests while the server is idle.
- The test suite is incrementally added to based on failures meta sees.
In-production testing (Ripple)
- Tests are colocated with normal workloads and run in bursts (~ms scale).
- Test sequences are meant to be performance-aware so as not to degrade workloads
- The footprint of these tests is observed to be less than a certain threshold, with A/B shadow testing to make sure there’s little impact.
- This testing is always on, with more testing that occurs during times of less work.

For Fleetscanner, which is more extensive:

93% of all SDC-affected machines are found.
23% of all detectable SDCs are only seen by Fleetscanner. With the con that it takes 6 months to reach full fleet coverage.

For Ripple, which is always on:

Detects 70% of SDC faults, within 15 days.
7% unique coverage, faults that only show up when tests are interleaved with workloads.