Reliability is everyone’s problem, not just the SRE's......
- What barriers are there
to SRE adoption?
<Shivagami
Gugan>
There are no real barriers towards adopting Site
Reliability Engineering practices. Any organization irrespective of whichever
level of maturity they currently exist today, can adopt SRE and improve its
business outcomes. However, the key underlying principles of SRE and its
applicability should be understood properly. There is a lot of hype around
transformation, and the first barrier is about understanding what SRE means
within your Organisational context. I find most of the time, people try
to imitate what highly mature companies like Google or Spotify or some other
cloud-born company do, and this usually results in failures. SRE is the
purest form of the implementation of DevOps. SRE is about removing the silos in
a Product lifecycle towards achieving business outcomes in a safer, faster,
cheaper, and better manner. If this is understood clearly, the first
barrier gets removed.
The second barrier is to have this complicated vision of
boiling the ocean - doing 100% Agile, 100% DevOps, or 100% SRE. This doesn’t
work, especially when you have heritage systems of records, on-prem services, and heritage infrastructure, and a whole lot of baggage built over several years
(sometimes decades) which is the usual case for most of the companies. So be wise in choosing target areas.
The third barrier is when you think SREs specializes in building operational reliability but do not have anything to do with the development or the deployment phases of the lifecycle. Reliability is everyone’s problem – the technical product manager, the developer, the tester, the support engineer; not just the SREs. The reality about the SRE role is to ensure service's availability and reliability by supporting the other teams that own these services. SREs are enablers, they are collaborators, and their goal is to ensure that the services are overall resilient, reliable and that incremental value is delivered in a continual manner. So they will have to wear multiple hats, sometimes depending upon the situational needs, chip in during design to ensure resiliency is built-in. Some other time they may be coaching and enabling teams to bake in proper observability within the product. The core value that anything “manual is evil” is applied across the spectrum. Hence they enable the team to automate using CI/CD flows, build quality measures towards self-regulation (such as test coverage, cyclo-complexity of code), automate deployments (infrastructure as code), automate closed loop remediations.
You may now appreciate that, this means the role plays in the
continuum of the product lifecycle and is not just restricted to
Operational aspects or incident response as we usually perceive the SRE role to
be. Usually when an Organisation understands that implementing SRE is an underpinning
cultural change that affects all part of the organization, then it becomes
easier to remove the main barriers.
- What human attributes or
characteristics make someone a great SRE?
<Shivagami
Gugan>
SREs are huge collaborators, they are people who are
goal-driven, have big picture thinking and having the ability to work on
multiple aspects of product resilience. This makes them multi-skilled, and
people who have an extensive growth mindset. They should have the ability to
get into details quickly, think on the feet and be brave towards problem
resolution. And when any mistakes happen (they always do!), SREs have the
ability to blamelessly look the situation which again makes them very cool
headed and collaborative. They hate anything that has to be done more than
twice and will always look towards automating anything that’s boring and
repetitive. They are great coders.
- What are the
ways to spot a great SRE?
<Shivagami
Gugan>
·
SRE are coders. They know the toolset of the
Product thoroughly.
·
If coming from the Dev side, they are
programmers who understand infrastructure, can shell script and write
interpreter code with ease. If coming from Ops side, they are the people who
understand application design and development.
·
They ensure SLOs are set at correct boundaries
of service, they define alerts to detect SLI thresholds
·
They measure and report performance against the
SLI –Availability (Up time, Error Ratio –5xx/Total Requests) Performance
(RPS, Latency)
·
Their Operation load is capped at ~50 percent
·
They enable developers on CI/CD automation,
quality thresholds and deployment automation using infrastructure as code
·
They enable developers to understand how their
applications are performing in production building observability, using
distributed tracing and APM tools
·
They thoroughly understand deployment, fail-safe
strategies - Rollback, Canary and Feature Flags.
·
They influence in building fault-tolerant,
autoscaling, cost-efficient, highly performing design and architecture.
·
SRE should ensure consumption of platform
standards, should raise pull requests to enhance SRE Product/ Tool chain
features.
·
SREs ensure consistency of tooling - All lower
environments use consistent methodologies and same tooling as used in higher
environments.
·
SREs handle on-call events and do post
mortems (For e.g. They are adept with Memory dump analysis, Thread dump
analysis, OS level diagnostics, Functional diagnostics)
·
SRE ensures error budgets are followed, they
ensure self-regulation of velocity and stability and ensure excess Ops
work overflows to the Dev team
Excerpt from my interview on SRE, if you wish to learn more, tune into DevOps Institute SkilUp days, and listen in to the entire talk......