Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
benefits.md		benefits.md
deployment-throughput.md		deployment-throughput.md
drawbacks.md		drawbacks.md
learning-culture.md		learning-culture.md
service-reliability.md		service-reliability.md

README.md

What is Ops Run It

Many organisations have an IT department with segregated Delivery and Operations functions. Delivery teams in the former are responsible for building services, and operations teams in the latter are responsible for running services. We call this Ops Run It. We often find it's entwined with long-established IT management standards such as ITSM and ITIL v3.

Ops Run It has been a de facto operating model for decades, and was codified by the COBIT management framework in 1996. COBIT recommended functionally-oriented teams of specialists, working within separate Plan, Build, and Run phases. That was justified by the unavoidably high compute and transaction costs for on-premise software in the 1990s. That pre-Internet rationale doesn't hold true today.

Ops Run It creates a hard divide between Delivery and Operations, powered by radically different incentives. Delivery teams are short-lived, and project-based. They're told to move fast and achieve their deliverables. Operations teams are long-lived, and told to move carefully and provide reliability.

It's important to remember that every organisation is different, and every Ops Run It implementation is different.

Recognising the problem
A large telco customer I worked with was having trouble with its monolithic systems. They had 100% outsourced delivery and operations teams. They were unable to achieve the deployment throughput they needed to meet demand. The operations team managed overnight releases, with a delivery team available to help solve release issues. They often had reliability problems after large deployments and peak trading events. Incidents were managed by an operations team with 'pass the buck' handovers to delivery teams, and their Mean Time To Repair (MTTR) was six hours. Delivery teams were asked to do L3 unpaid support, and this led to resentment. In addition, the commercial model for incident response meant it wasn't financially viable to pay the outsourced operations team to fix incidents that weren't P1 or P2. They had a strong predisposition towards rush to fix. Many fix actions remained on a problem backlog without ever being prioritised or fixed. This customer understood that they needed to fundamentally change their operating model, and they were interested in You Build It You Run It. They started on their journey by automating their manual regression testing and monolith deployments, so they could incrementally improve service reliability. The next step was to look at their overall architecture, and review if the monolith architecture and team structures were the fit for Continuous Delivery and You Build It You Run It. They soon began a journey to rip and replace their monolithic systems, and revisit the interactions between delivery and operations teams at each stage of the process. Bethan Timmins Managing Director EE Australia & New Zealand

Recognising the problem

A large telco customer I worked with was having trouble with its monolithic systems. They had 100% outsourced delivery and operations teams.

They were unable to achieve the deployment throughput they needed to meet demand. The operations team managed overnight releases, with a delivery team available to help solve release issues.
They often had reliability problems after large deployments and peak trading events. Incidents were managed by an operations team with 'pass the buck' handovers to delivery teams, and their Mean Time To Repair (MTTR) was six hours. Delivery teams were asked to do L3 unpaid support, and this led to resentment. In addition, the commercial model for incident response meant it wasn't financially viable to pay the outsourced operations team to fix incidents that weren't P1 or P2.
They had a strong predisposition towards rush to fix. Many fix actions remained on a problem backlog without ever being prioritised or fixed.

This customer understood that they needed to fundamentally change their operating model, and they were interested in You Build It You Run It. They started on their journey by automating their manual regression testing and monolith deployments, so they could incrementally improve service reliability.
The next step was to look at their overall architecture, and review if the monolith architecture and team structures were the fit for Continuous Delivery and You Build It You Run It. They soon began a journey to rip and replace their monolithic systems, and revisit the interactions between delivery and operations teams at each stage of the process.

Bethan Timmins
Managing Director
EE Australia & New Zealand

The special character and the million pound outage
I worked with a large ecommerce customer, where six delivery teams built a monolithic website on top of a vendor commerce platform. The website was run by an operations bridge team and an application support team. The customer wanted to scale up to 25 delivery teams, and replace the monolith with microservices. One day, we lost the ecommerce website from 0800 to 1300, and it cost us $1.5 million in trade. I was in the incident room, and we battled through a succession of latent faults including poor error handling, inadequate telemetry, a missing cache health check, and invalid character rendering. We slowly realised a single special character typed into a content asset had caused a huge chain reaction, and we'd lost the website as a result. In the root cause analysis session afterwards, the operations manager declared there were many contributing factors to learn from, not a single human error. The application support team manager flagged his concerns about centralised production support, including knowledge transfer challenges and the plans for 25 delivery teams. The incident was a turning point. The operations manager was supportive of You Build It You Run It, and delivery teams building operability into microservices. Together, we gradually experimented with a new operating model in which delivery teams did their own deployments, and went on-call out of hours for critical services. This contributed to a sea change in deployment frequency and incident response times, and it held firm as the number of delivery teams reached 25 and beyond. Steve Smith Principal Consultant EE UK

The special character and the million pound outage

I worked with a large ecommerce customer, where six delivery teams built a monolithic website on top of a vendor commerce platform. The website was run by an operations bridge team and an application support team. The customer wanted to scale up to 25 delivery teams, and replace the monolith with microservices.

One day, we lost the ecommerce website from 0800 to 1300, and it cost us $1.5 million in trade. I was in the incident room, and we battled through a succession of latent faults including poor error handling, inadequate telemetry, a missing cache health check, and invalid character rendering. We slowly realised a single special character typed into a content asset had caused a huge chain reaction, and we'd lost the website as a result.

In the root cause analysis session afterwards, the operations manager declared there were many contributing factors to learn from, not a single human error. The application support team manager flagged his concerns about centralised production support, including knowledge transfer challenges and the plans for 25 delivery teams.

The incident was a turning point. The operations manager was supportive of You Build It You Run It, and delivery teams building operability into microservices. Together, we gradually experimented with a new operating model in which delivery teams did their own deployments, and went on-call out of hours for critical services. This contributed to a sea change in deployment frequency and incident response times, and it held firm as the number of delivery teams reached 25 and beyond.

Steve Smith
Principal Consultant
EE UK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what-is-ops-run-it

what-is-ops-run-it

README.md

What is Ops Run It

Files

what-is-ops-run-it

Directory actions

More options

Directory actions

More options

Latest commit

History

what-is-ops-run-it

Folders and files

parent directory

README.md

What is Ops Run It