Skip to content

Latest commit

 

History

History
27 lines (22 loc) · 1.35 KB

site-reliability-engineering.md

File metadata and controls

27 lines (22 loc) · 1.35 KB
title status category tags
Site Reliability Engineering
Completed
concept
methodology

Site Reliability Engineering or SRE is a discipline that combines operations and software engineering. The latter is applied to infrastructure and operations problems, specifically. Meaning, instead of building product features, Site Reliability Engineers build systems to run applications. There are similarities with DevOps, but while DevOps focuses on getting code to production, SRE ensures that code running in production works properly.

Problem it addresses

Ensuring applications run reliably requires multiple capabilities, from performance monitoring, alerting, debugging to troubleshooting. Without these, system operators can only react to problems vs. proactively working towards avoiding them — downtime only becomes a matter of time.

How it helps

An SRE approach minimizes the cost, time, and effort of the software development process by continuously improving the underlying system. The system continuously measures and monitors the infrastructure and application components. When something goes wrong, the system points Site Reliability Engineers to when, where, and how to fix it. This approach helps create highly scalable and reliable software systems by automating operational tasks.