the.com/sre
software engineers who treat outages like crime scenes and uptime like religion.
means site reliability engineering: applying software engineering discipline to keep systems running, using code instead of manual ops to prevent things from breaking.
from coined at google around 2003 when ben treynor sloss was asked to run a production team and, being a software engineer, built one out of engineers instead of sysadmins.
error budgetsteams are allowed a set amount of downtime, spent deliberately
toilrepetitive manual work sre exists to automate away
five srestreynor sloss started google's team with just five people
bookgoogle's sre book is free online and industry gospel