This paper by Meta discusses detecting Silent Data corruptions in production to catch faulty CPUs and other parts.
Silent Errors
- Silent errors don’t trigger machine checks, ECC errors, or log at all, but instead quietly produce wrong answers.
- Meta observes these at scale, with an empirical rate of around 1 in 1000 devices, compared to the traditional 1 in 1000000.
- These faults can propagate across services and cause data loss or other weird behavior.
Testing
The paper notes the testing funnel:
- Design & Verification
- Post-silicon validation
- Manufacturer testing
- Integrator testing (at rack-level)
- Infrastructure intake/burn-in (running workloads before prod)
- Infrastructure fleet testing (testing devices after deployment, while in fleet)
SDC problems
Detection is hard because of:
- Some operations fail for only specific bit patterns.
- Electrical variations: changes in voltage, frequency, current, can trigger or hide faults
- Environment: temperature, humidity, hotspots can expose defects
- Silicon aging: devices degrade over time.
So, continuous testing in prod is the only way to explore this state space.
How Meta tests
- Out of production testing (Fleetscanner)
- Runs a suite of tests when machines are drained, i.e. not in the fleet, for upgrades or repair.
- This is an extensive set of tests while the server is idle.
- The test suite is incrementally added to based on failures meta sees.
- In-production testing (Ripple)
- Tests are colocated with normal workloads and run in bursts (~ms scale).
- Test sequences are meant to be performance-aware so as not to degrade workloads
- The footprint of these tests is observed to be less than a certain threshold, with A/B shadow testing to make sure there’s little impact.
- This testing is always on, with more testing that occurs during times of less work.
Results
For Fleetscanner, which is more extensive:
- 93% of all SDC-affected machines are found.
- 23% of all detectable SDCs are only seen by Fleetscanner. With the con that it takes 6 months to reach full fleet coverage.
For Ripple, which is always on:
- Detects 70% of SDC faults, within 15 days.
- 7% unique coverage, faults that only show up when tests are interleaved with workloads.
Discussion
- Silent data corruption is a big deal
- You can’t catch all errors before sending out to your fleet.
- Out of production and in-production testing can detect most SDCs.