SHIVAGAMI GUGAN's BLOG: Reliability is everybody’s problem

Reliability is everyone’s problem, not just the SRE's......

What barriers are there to SRE adoption?

<Shivagami Gugan>

There are no real barriers towards adopting Site Reliability Engineering practices. Any organization irrespective of whichever level of maturity they currently exist today, can adopt SRE and improve its business outcomes. However, the key underlying principles of SRE and its applicability should be understood properly. There is a lot of hype around transformation, and the first barrier is about understanding what SRE means within your Organisational context. I find most of the time, people try to imitate what highly mature companies like Google or Spotify or some other cloud-born company do, and this usually results in failures. SRE is the purest form of the implementation of DevOps. SRE is about removing the silos in a Product lifecycle towards achieving business outcomes in a safer, faster, cheaper, and better manner. If this is understood clearly, the first barrier gets removed.

The second barrier is to have this complicated vision of boiling the ocean - doing 100% Agile, 100% DevOps, or 100% SRE. This doesn’t work, especially when you have heritage systems of records, on-prem services, and heritage infrastructure, and a whole lot of baggage built over several years (sometimes decades) which is the usual case for most of the companies. So be wise in choosing target areas.

The third barrier is when you think SREs specializes in building operational reliability but do not have anything to do with the development or the deployment phases of the lifecycle. Reliability is everyone’s problem – the technical product manager, the developer, the tester, the support engineer; not just the SREs. The reality about the SRE role is to ensure service's availability and reliability by supporting the other teams that own these services. SREs are enablers, they are collaborators, and their goal is to ensure that the services are overall resilient, reliable and that incremental value is delivered in a continual manner. So they will have to wear multiple hats, sometimes depending upon the situational needs, chip in during design to ensure resiliency is built-in. Some other time they may be coaching and enabling teams to bake in proper observability within the product. The core value that anything “manual is evil” is applied across the spectrum. Hence they enable the team to automate using CI/CD flows, build quality measures towards self-regulation (such as test coverage, cyclo-complexity of code), automate deployments (infrastructure as code), automate closed loop remediations.

You may now appreciate that, this means the role plays in the continuum of the product lifecycle and is not just restricted to Operational aspects or incident response as we usually perceive the SRE role to be. Usually when an Organisation understands that implementing SRE is an underpinning cultural change that affects all part of the organization, then it becomes easier to remove the main barriers.

What human attributes or characteristics make someone a great SRE?

<Shivagami Gugan>

SREs are huge collaborators, they are people who are goal-driven, have big picture thinking and having the ability to work on multiple aspects of product resilience. This makes them multi-skilled, and people who have an extensive growth mindset. They should have the ability to get into details quickly, think on the feet and be brave towards problem resolution. And when any mistakes happen (they always do!), SREs have the ability to blamelessly look the situation which again makes them very cool headed and collaborative. They hate anything that has to be done more than twice and will always look towards automating anything that’s boring and repetitive. They are great coders.

What are the ways to spot a great SRE?

<Shivagami Gugan>

· SRE are coders. They know the toolset of the Product thoroughly.

· If coming from the Dev side, they are programmers who understand infrastructure, can shell script and write interpreter code with ease. If coming from Ops side, they are the people who understand application design and development.

· They ensure SLOs are set at correct boundaries of service, they define alerts to detect SLI thresholds

· They measure and report performance against the SLI –Availability (Up time, Error Ratio –5xx/Total Requests) Performance (RPS, Latency)

· Their Operation load is capped at ~50 percent

· They enable developers on CI/CD automation, quality thresholds and deployment automation using infrastructure as code

· They enable developers to understand how their applications are performing in production building observability, using distributed tracing and APM tools

· They thoroughly understand deployment, fail-safe strategies - Rollback, Canary and Feature Flags.

· They influence in building fault-tolerant, autoscaling, cost-efficient, highly performing design and architecture.

· SRE should ensure consumption of platform standards, should raise pull requests to enhance SRE Product/ Tool chain features.

· SREs ensure consistency of tooling - All lower environments use consistent methodologies and same tooling as used in higher environments.

· SREs handle on-call events and do post mortems (For e.g. They are adept with Memory dump analysis, Thread dump analysis, OS level diagnostics, Functional diagnostics)

· SRE ensures error budgets are followed, they ensure self-regulation of velocity and stability and ensure excess Ops work overflows to the Dev team

Excerpt from my interview on SRE, if you wish to learn more, tune into DevOps Institute SkilUp days, and listen in to the entire talk......

SHIVAGAMI GUGAN's BLOG

Thursday, 22 April 2021

Reliability is everybody’s problem

No comments:

Post a Comment

Strands Agents – An Open-source python SDK for building agents

Report Abuse

Labels