How can I persuade developers on my team to embrace You build it, you run it?

HOW TO -️ October 18, 2021

How can I persuade developers on my team to embrace "You build it, you run it"? By that, I have this quote from Werner Vogels in mind:

Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.

I'm specifically thinking of a set of developers that:

  • Were hired into a developer role, with little/no mention of ops-related tasks.
  • Traditionally have "thrown code over the wall" to an ops team.
  • Traditionally have a 9-5 work schedule, and are actively hostile to the idea of "pager duty", participating in disaster recovery, writing post-mortems, etc, especially outside of normal business hours. (Note: I only have very infrequent outages in mind for this; I am not proposing that we add after-hours customer support to this team's workload.)
  • Are not currently responsible for writing/supporting monitoring or alerting on their applications.

Suppose there is a team that is rapidly developing new cloud micro-services with a profile that is getting to be such that handing these services off to an ops team is sub-optimal because they can't keep up in regards to gaining deep knowledge of the services that is required to effectively manage and monitor them. "You build it, you run it" would work better for this team because tasks could delegated to each responsible team member. So this team would begin taking part in designing infrastructure, monitoring/alerting tools for the services, and (very infrequently) responding to outage events.

I am specifically interested in methodologies, backed up by real world examples. How this been successfully implemented in other workplaces, and if there any canonical steps to follow while implementing this? Any links to write-ups that can support answers would be very helpful.


this might be worth asking at workplace SE as well