Site reliability engineering

Site reliability engineering (SRE) improves information technology (IT) by utilizing software engineering best practices.

A site reliability engineer is trained in incident response and incident management. They will identify incidents, log reports, triage and determine prioritization level, diagnose the problem, escalate the issue, and ultimately work toward resolution and closure. 

While serious and costly issues might eventually involve several people and departments, the site reliability engineer will typically be the first person on the scene. They will make next-best-action recommendations, and the SRE focuses on system reliability, performance, incident response, and latency. 

Per Google, “SRE is what you get when you treat operations as if it’s a software problem. Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services — Google Search, Ads, Gmail, Android, YouTube, and App Engine, to name just a few — with an ever-watchful eye on their availability, latency, performance, and capacity.”

IBM notes that site reliability engineers can act as middlemen who communicate with both development teams —who typically prioritize innovation and new releases — and operations teams, who are inclined to proceed more cautiously. In this scenario, a site reliability engineer might encourage development teams to slow down, and then also convince operations teams that it is time to proceed and move forward. In sum, the engineer keeps the peace between development and operations. Day to day, site reliability engineers will be involved in both manual IT operations and code development that improves automation.

Site reliability engineers remain committed to a company’s obligations to customers and clients, who received a service level agreement (SLA) about the level of service they expect from the vendor. For site reliability engineers, service-level objectives (SLO) provide an error budget for the company. Once the number of errors reaches a certain point, customers will likely be unhappy. Therefore, the goal of a site reliability engineer, and the entire SRE team, is to keep the number of errors below this threshold.

A site reliability engineer who meets expectations will increase the reliability and efficiency of systems.

Business benefits of site reliability engineering include the following:

  • Proper incident analysis
  •  Better use of metrics
  • Continually automated and modernized operations
  • Increased communication between development and operations teams
  • Adherence to the error budget
  • Early identification and resolution of bugs
  • Better SLA performance and attached increase in customer satisfaction
Related content