The Incident Review: 4 Times When Typos Brought Down Critical Systems
Sometimes, as these 4 incidents highlight, major failure results from a mere typo or configuration oversight.
Incident Management vs. Incident Response - What's the Difference?
What are the differences between incident management and incident response? The answer varies widely depending on whom you ask.
The Incident Review: 4 Odd Incidents Caused by Animals
Incidents and outages caused by animals highlight the importance of flexibility and out-of-the-box thinking when it comes to SRE.
Practical Guide to SRE: Using SLOs to Increase Reliability
Service Level Objectives (SLOs) are a key component of any successful Site Reliability Engineering initiative. The question is, what are SLOs; and how do you determine what your SLOs should be? Once you've done that, how should you use them?
Practical Guide to SRE: Automating On-Call
Let's all face it, on call work isn't fun. But it can be better. Even if you have to work on call, it would be nice to have at least some of the work done for you, before you drag yourself out of bed at 3am to respond to an incident.
How Kubernetes Can Both Help and Hinder Incident Management Teams
Kubernetes makes it easier in certain ways to manage reliability. But incident response teams and SREs must also be prepared to handle the unique reliability challenges that K8s creates.
Creating Chaos to Achieve Reliability
How can creating chaos achieve better reliability? Chaos and reliability might seem mutually exclusive, but through the use of Chaos Engineering, SREs can bring about meaningful changes to system resiliency.
Should You Be an SRE or a DevOps Engineer?
SREs may have better long-term job prospects, but DevOps might be an easier career to pursue.
How Would an SRE Conduct a Postmortem on the Suez Canal Incident?
The Suez Canal has been big news over the last couple of weeks. We wondered how a Site Reliability Engineer (SRE) might conduct a postmortem on what happened with the Ever Given, and what that might mean if a comparable incident occurred at a modern tech company.