the.com/fault tolerance
the art of failing without anyone noticing you failed.
means a system's ability to keep working correctly even when one or more of its parts break.
from emerged from 1960s aerospace and mainframe computing, where NASA and IBM engineers realized a single flipped bit shouldn't end a mission or crash a bank.
redundancy trickoften just means running three copies and voting.
space originapollo guidance computer pioneered graceful failure recovery.
not preventionassumes failure will happen, plans around it anyway.
byzantine problemhardest case is a component that lies, not dies.