Believe it or not, Site Reliability Engineering did not begin with the Google book publication in 2016. The term was coined in 2003 by Benjamin Treynor (still Google), who was put in charge of running a production team staffed by seven engineers. As of the writing of this post, that was seventeen years ago! It is hard to believe, given the fairly recent success of the book’s publication on SRE it has been in practice for so long.
This hugely successful marketing campaign has resulted in tons of tech companies moving to implement SRE in their companies. It should be noted that since the inception of SRE — the idea of DevOps was also introduced thanks to Patrick Debois in 2009, causing some confusion, combination, and integration of both ideas into organizations small and large. While any framework, ideal, methodology, practice, etc., work best when customized to organizations’ needs and structure, the core objective and idea behind SRE seems to be getting fuzzier with every implementation of this practice. It may be time to reset expectations and clear things up by going all the way back while learning from the present.
The minimum requirements of an empowered SRE organization are predicated on engineers within this group having the ability to ensure applications, services, and systems are:
…and taking action when they are not. Benjamin had the ownership and team needed to operate a production system, it was shift in mindset that facilitated the birth of SRE. It is an exercise in futility to be prescriptive about what tooling, panes of glass, and instrumentation are used to get the job done; this decision will vary for every business, infrastructure, size, and complexity of services. SRE organizations must have the tools they need to paint the whole picture or as much of it that can be painted. There is no one size fits all solution.
SREs are relied upon to discover a baseline of health and work in conjunction with development teams to establish SLOs, error budgets, SLIs, and SLAs. All in the service of ensuring services are available, reliable, and serviceable. Measure it, modify it, measure it again… it’s a cycle that continues for the life of a service. SREs are passionate about having an impact in these key areas.
Site reliability engineers can serve as a neutral third party within an organization to debug, troubleshoot, configure, and monitor distributed systems’ health. It requires shared ownership of the resulting product/services across a company. Tooling cannot solve this. Everyone needs some skin in the game.
The Role of SRE Leadership
Having worked in what is classically known as operations for a decade as both an individual contributor and now a leader in this space, I understand that an SRE organization cannot be effective without accessibility, influence, and controls. The collaborative nature that makes up the SRE and the DevOps model, for that matter, requires strong influence and a shift in cultural norms for many technology companies. This is more than a technical shift; this is a human shift. This is where leaders can step up, and champion shared ownership of critical services.
It is an investment in the technical culture and health of an organization that can be challenging to measure — it can, however, be measured.
How does a company measure the revenue increase of loyal customers? Loyal customers are a direct result of a stable and reliable service. Being able to adjust capacity and infrastructure based on established measure objectives proactively is part of reliability.
How does a company measure the value of employee retention? It is a widely held understanding that the cost to replace skilled software developers/engineers is not negligible. Keeping talented people engaged and productive can be a challenge. Following an SRE culture ensures that there are opportunities for new and seasoned engineers to have career paths that they can follow for many years. This is all part of SRE leadership.
How does a company measure the cost of operating and maintaining an infrastructure that is resilient, scalable, and flexible enough to meet the ever-increasing demand of customers in the 21st century? Having an established SRE organization ensures that this risk is being controlled and optimized.
This list could go on and on how to measure the investment in SRE from a leadership perspective.
Fostering a Culture of SRE
Fostering an SRE culture supports the industry’s movement to establish healthier work cultures, removing the need for the “hero,” increasing talent retention, breaking down silos, and mitigating knowledge gaps that have plagued this industry for even longer than seventeen years.
The recruiting process for site reliability engineers is as unique as the role itself. It requires a level of neutrality yet awareness of the services/systems/products in production. As well as a well-nurtured operational mindset with an eye towards excellence (planning to blog about that later) and getting things done. It is often the case that the most effective SRE teams are composed of a combination of skill sets working together from system engineers, web developers, technical project managers, administrators, and others.
Asking this skillset and focus long-term from the same group of engineers creating a service’s functionality is not only unreasonable, but it is also unhealthy — it does not scale. It is my core belief that forward-thinking tech leaders build for the future, always keeping the end in mind.
We now have the privilege to look back at seventeen years worth of iterations of SRE. As an SRE leader, I am thankful and determined to make the next seventeen years even more empowering for those seeking opportunity and excellence in tech, no matter what the term is coined. It is the direction we are moving in that matters.
Don’t just take my word for it. Check out these amazing resources (videos get longer as you scroll down so pace yourself):