Thursday, 22 December 2022

Digitisation or Digitilisation - how does it matter?


Recently I watched a TV show where there were heated discussions amongst the panel members on the topic of digitisation and digitilisation. The first few minutes were spent by the experts explaining what they thought should be the right choice of word. For me, this debate made no sense at all.

“What's in a name? That which we call a rose by any other name would smell as sweet” as Shakespeare says.

I’ve been asked multiple times recently what “digital transformation” means? Given that by 2023 conservative estimates of annual expenditure on digital transformations are north of $2 trillion but with well over 50% of spend considered wasted to date, it seems like a fundamentally important concept to understand. 

"If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute resolving it"  said Albert Einstein

One of my favourite questions I ask customers is, “What problem are you trying to solve?” "What is keeping you awake at night" Often, we race into solutioning a poorly defined problem. It’s not uncommon to launch into a transformation effort without fully understanding, or agreeing on, what success looks like. What might start off as an attempt to increase an organisation’s ability to more agilely respond to customer needs can easily become a misguided attempt to prescriptively implement transformation.So down to brass tacks: digital transformation is primarily about continually evolving an organisation’s culture and engagement "with customers" enabled by technology as appropriate. Digital transformations primarily are not about technology alone. It is about creating organisational agility in the face of changing customer needs and competitive landscapes. It is also about creating options that allow you to more frictionlessly pivot based on learnings, mitigating the need to be an organisational clairvoyant. Digital transformations typically have has four connotations:

1) Engage even tighter with your customer, and hyper-personalise your services
2) Optimise your operations 
3) Innovate your products/services
4) Empower your people

Understanding  the current and future state of customer journey and hyper personlisaing their experiences is key. The customer journey is usually about building a deep undersatnding of how your customers and business interact step by step, enumerating all the touchpoints. “Customers” broadly means employees, suppliers, or actual paying customers depending on your business. It about identifying opportunities to remove friction,  It about analysing the journey not purely to optimise but also determine new avenues that customers desire, and identifying moments where the customer can be delighted,. Its about identifying how to recover from subpar customer experiences. In the case of a successful hypermarket, these new avenues may include home delivery, mobile ordering, curb-side pick-up, table service and building feedback loops from each experience. The journey should be at a manageable level of detail without calling out every paralysing exception, but also without trying to over standardise or over generalise the experiences. Different types of customers typically act in different ways, creating multiple journeys. This can be addressed by creating personas for different customer groups. These form good approximations for behaviour, Over time we can iterate these personas, getting into more detail in order to drive a greater degree of personalisation - this is a key tenet of many companies’ transformation.

Data is the biggst enabler to deliver hyper-personalisation of services. Mixing data at rest with that in motion, and constantly enriching data from multiple sources, addressing data ownership, lineage issues and building scalable services that can be acccessed through standardised protocols will be an enabler of omni-channel experience.

Optimising Operations is about thinking granular on all aspects (such as Agile delivery, microservices patterns, infrastructure containerisation, cloud adoption). This is also about getting into a risk-mitigation mindset and buiding a fail-fast culture. It's also about adopting automation as much as possible, and building resiliency through DevSecOps and SRE cultures.

Adopting an agile way of delivering services and building an omni-channel customer experience will not happen having an archaic "do it all on my own" mentality. Building innovation with agility is about being sharing and using communities help, leveraging the partner ecosystems and rapidly tapping capabilities and resources to influence the entire value and supply chains. Agility is key here over getting it all right, and perfection is achieved over tests and runs.

And last, but not the least, people are an Organisation's biggest assests. One thing that the pandemic and the great resignation wave has taughts us is that the minds of the employees are liberated even more, and employers need to exercise additional care in managing this crucial asset. One of the ways is go back to Maslow's theory of esteem/self-actualisation and Pink's autonomy, mastery, and purpose to trigger intrinsic motivation. Empower people, entrust them with the new ways of doing things, build a blamesless, fail-safe culture. Frameworks such as SAFe teach us how this can be achieved within acceptable guardrails.

And how do we measure success? of course there are internal measures, lead cycle-time reductions, number of new features promoted, meantime to recover, change fail rates - but the single most effecive metric is CSAT. Does your customer feel the difference in a positive way? Perception is reality. Even if the perception may be far call away from what the glorious metrices read on your new shiny dashboards
- so it is a great point to note that transformations are always about working backwards from the customer and its about doing it alongside and with the business, - and remember never is a Technology transformation alone a Digital transformation.





Sunday, 20 November 2022

Thirukural - the beautiful expression of real-life in 7 words

Last week someone I knew, told me he met a young Doctor in Bengaluru practicing as GP in a clinic he attended for some ailment. He was intrigued that he shared a common last name as me. He told me that he was very mature in talking, understanding the problems, and diagnosis, and he said being a Gez Z his behavior was way beyond his young age!... Curiosity invoked I asked him which clinic he was talking about and he said Kaggadaspura clinikk health hub.

I was pleasantly surprised, I asked him what was the Dr's name - he told me it was Dr.Visagan Gugan!

I told him that he shared the same last name because he was my SON!

What a coincidence...

This brought to my fore one of my favorite Thirukurral verses.

Kural 69

ஈன்ற பொழுதின் பெரிதுவக்கும் தன்மகனைச்
சான்றோன் எனக்கேட்ட தாய்

Transliteration

When a mother hears him named 'fulfill'd of wisdom's lore,' Far greater joy she feels, than when her son she bore

 Plain Explanation  

The mother who hears her son called "a wise man" will rejoice more than she did at his birth

For those who do not know Thirukural, a small introduction to this great scripture -

 The Tirukkuṟaḷ or shortly the Kural (Tamil: குறள்), is a classic Tamil language text consisting of 1,330 short couplets, or kurals, of seven words each. In short, each Kural is a beautiful poetry of life skill coaching material,  the scripture is estimated to be more than 2000 years old, and written by a Saint named Thiruvalluvar.



=============

There are far too many good couplets to quote, but keeping the "Wealth of Children" chapter in focus, the following is a  further reference to this great scripture:

https://www.thirukkural.net/en/kural/adhigaram-007.html








Saturday, 12 November 2022

Why Bulkhead architecture

The bulkhead architecture is used to build fault tolerance, its a common application design that is tolerant of failure. In this architecture, elements of an application are isolated into pools so that if one fails, the others will continue to function. Use this pattern to:

  • Isolate resources used to consume a set of backend services, if your application can provide some level of functionality even when one of the services is not responding.
  • Isolate critical consumers from standard consumers.
  • Protect the application from cascading failures.
Typically in cloud-based applications, each service may have one or more consumers. Excessive load or failure in service will impact all consumers of the service. If we limit the max number of threads that can be used for an endpoint, we will always have some resources to process and this will avoid saturation of all endpoints. To implement the bulkhead pattern, we need to make sure that all our services work independently of each other and failure in one will not create a failure in another service. Techniques such as maintaining a single-responsibility pattern, an asynchronous-communication pattern, or fail-fast and failure-handling patterns help us to achieve.




Friday, 30 September 2022

Top Technology Trends that CTOs can blindly follow




1) ComposableArchiteture

Composable Applications allow polyglot microservices-based packaged-business capabilities (PBCs) or software-defined business objects. PBCs — for example representing a patient or digital twin — create reusable modules that the IT-Business fusion teams can self-assemble to rapidly create applications, reducing time to market. Champion composable architectural principles in all new technology initiatives, including application modernization, new engineering, and the selection of new vendor services. Buy standard PBCs on application marketplaces and integrate using APIs. According to Gartner, by 2024, the design mantra for new SaaS and custom applications will be “composable API-first or API-only,” rendering traditional SaaS and custom applications as “legacy.”

2) Data Fabric/ Data Platform

The value of data has never been more valuable. But often, data remains siloed within applications, so it’s not being used as effectively as possible. Data fabric integrates data across platforms and users, making data available everywhere it’s needed.

Within inbuilt analytics reading metadata, the data fabric is able to learn what data is being used. Its real value exists in its ability to make recommendations for more, different, and better data, reducing data management by up to 70%.

Identify priority areas to introduce data fabric solutions by using metadata analytics to determine current data utilization patterns for ongoing business operations. Prioritize areas with significant drift between actual and modeled data.

3) Cybersecurity Mesh

Using a cybersecurity mesh approach, you can integrate multiple data feeds from distinct security products to better identify and respond more quickly to incidents. Digital business assets are distributed across cloud and data centers. Traditional, fragmented security approaches focused on enterprise perimeters leave organizations open to breaches.

A cybersecurity mesh architecture provides a composable approach to security based on identity to create a scalable and interoperable service. The standard integrated structure secures all assets, regardless of location, to enable a security approach that extends across the foundation of IT services.

4) Privacy-Enhancing Computation

The real value of data exists not in simply having it, but in how it’s used for AI models, analytics, and insight. Privacy-enhancing computation (PEC) approaches allow data to be shared across ecosystems, creating value but preserving privacy. Approaches vary, but including encrypting, splitting, or preprocessing sensitive data to allow it to be handled without compromising confidentiality is the art of PEC. PEC platform uses homomorphic encryption so users can conduct data searches against its extremely sensitive data, with both the search and the results being encrypted

Investigate key use cases within the organization and the wider ecosystem where a need exists to use personal data in untrusted environments or for analytics and business intelligence purposes, both internally and externally. Prioritize investments in applicable PEC techniques to gain an early competitive advantage.

5) Cloud-Native Platforms

According to Gartner, By 2025, cloud-native platforms will serve as the foundation for more than 95% of new digital initiatives — up from less than 40% in 2021.

Lift-and-shift cloud migrations focus on taking legacy workloads and placing them in the cloud. Because these workloads weren't designed for the cloud, they require a lot of maintenance and don't take advantage of any of the benefits. 

Cloud-native platforms use the core elasticity and scalability of cloud computing to deliver faster time to value. They reduce dependencies on infrastructure, freeing up time to focus on application functionality instead.

Typical use cases are to build a cloud-native platform to create a portfolio of new digital services. For example, a bank can reduce the time to open an account to 5 minutes and add instant digital payments when using a well-architected technology platform. Deployment microservices architecture enables the integration of services such as savings, virtual debit card, and credit card services, allowing the system to easily scale to over 3.5 million transactions in two months.

6) AI/ML/Metaverse/ AR/ VR/Computer Vision

Distributed enterprise is a virtual-first, remote-first architectural approach to digitize consumer touchpoints and build out experiences to support products. While AI engineering is the discipline of operationalizing AI models, using integrated data and model and development pipelines to deliver consistent business value from AI, the use of NFT/blockchain-based metaverse builds on Web 3.0 principles to enable 'play to earn' gaming, AR/VR enabled retail e-commerce, real estate, hospitality, corporate training, induction on manufacturing shopfloors to aircraft engine simulations has seen a major boost in government and private investment.


Sunday, 11 September 2022

Spectrum of SRE Implementation Models

 

A Spectrum of SRE Implementation models

A significant aspect of SRE implementation at Enterprise is around the model that will enable governance and growth at pace. In this article we will look at the Hub and Spoke model as an approach to solve the SRE scalability challenges towards applying a product management life approach that enables rapid, repeatable SRE practices that are cross-pollinated from the spokes that are usually across business domains.



 

 

The Hub & Spoke model not only decentralizes the implementation of solutions, but it also allows for rapid innovation / sharing of ideas across the organization, while centralizing research for latest best practices. It helps attract, develop, and retains scarce SRE talent, allowing for flexible allocation of resources to keep employees challenged with new perspectives.

 




A few considerations are as below, as SRE capabilities mature, the governance model will evolve with more talent sitting in the spokes, meaning more work completed by the business sectors & the hub acting as a champion available when needed.

 

 




The SRE HUB exists to enable self-service model for the spokes. Typically built using the in-source model that builds embedded governance that is fit for purpose, the cross-pollination from spokes is key to following the product-based approach by the HUB.

 






Monday, 5 September 2022

The Evolution of the Mainframe

 


Mainframe and its evolution with Cloud Computing




The key attributes associated with mainframe computing are high resilience, high manageability, and scalability. Despite the momentum driving public cloud adoption, there remain workloads that cannot easily be migrated to the public cloud. Whether it is deemed too risky to migrate or reworking legacy code is cost-prohibitive, mainframe computing remains an integral part of IT ecosystem. There is a growing demand for reworking some mainframe workloads to run cloud natively on cloud infrastructure. But the risks associated with this often mean the core back-end mainframe system remains untouched in many organizations. APIs are used to provide external connectivity in order to enable enterprise developers to build modern functionality, combining the best the public cloud can offer with reliable transaction processing embodied in the mainframe.

Over the last few years, Cloud computing has evolved to the point where it is now promising the same level of scalability, flexibility, and operational efficiency that mainframe systems have long provided. In fact, in terms of scalability, it exceeds mainframe scalability. With scalability, throughput, operational efficiency, and arguably even resilience and failover, the cloud has arguably caught up with the mainframe of the 1990s or early 2000s. It is fair to say that cloud providers have made great strides in security and privacy, but, the mainframe is still recognized as the gold standard, with security baked into every layer in the systems stack.

The mainframe ecosystem and the z/OS operating system have evolved too and IBM has introduced specialty processors to run Linux workloads and support encryption, greatly increasing the flexibility of mainframe systems. Cloud providers offer support for specialist workloads for non-x86 hardware, such as graphics processing units (GPUs) for machine learning and AI. But the introduction of the latest addition to the z-series mainframe family, the z16, offers what IBM claims is the gold standard for highly secured transaction processing.

The mainframe environment is getting bigger with announcements such as those made at the recent launch of the IBM z16. These include quantum-safe cryptography to protect against the development of Quantum computers able to decrypt current encryption standards, on-chip AI acceleration to boost ML and AI execution and flexible capacity combined with on-demand workload transfer across multiple locations to further reduce the chance of service disruption.

On workload optimization, the two environments are developing in different ways. For example, the mainframe strives to deliver a consistent environment that can handle a wide range of workloads but is managed through the same set of frameworks and tools. The cloud, on the other hand, allows you to spin up dedicated specialized environments, e.g. for AI or analytics. Also, IBM Cloud’s ambition to make "mainframe as a service" available from its IBM Cloud and available across data centers, brings the mainframe capabilities closer to cloud-native offerings.

The modern mainframe, particularly LinuxOne version and the new Z16, it's pretty clear any claims of the mainframe being out of date or legacy stem from a fundamental lack of awareness. Indeed, the mainframe has continued to lead the way in many critical areas, delivering IT cost-effectively, and is far away from becoming obsolete.

 

Wednesday, 10 August 2022

DevOps, DevSecOps, and now NoOps, GitOps, BizDevOps, AIOps... what are these

As DevOps becomes more popular and continues to evolve, more variations are appearing There has always been DevSecOps, and now there is DevSecTestOps and DevSecTestMonOps, and so on…

however, this does not make any sense as DevOps integrates and encompasses Security, Monitoring, Observability and Test Automation tenets already. DevOps without Security is meaningless.

What is  BizDevOps? NoOps? DataOps? GitOps? What other terms have emerged and what do they mean? 

GitOps is an operational framework that takes DevOps best practices used for application development such as version control, collaboration, compliance, and CI/CD, and applies them to infrastructure automation 

BizDevOps is an agile software development methodology that encourages greater communication and collaboration among business, development, and operations teams throughout the software development lifecycle. 

NoOps is the idea that the software environment can be so completely automated that there's no need for an operations team to manage it 

DataOps is a collaborative data management practice focused on improving the communication, integration, and automation of data flows between data managers and data consumers across an organization. This includes an integrated approach to data ingestion flows to data engineering and data processing methods further leading to meaningful data analytics. Finally it's very gratifying that DevOps has been considered the pioneer for and a representation of a framework that breaks silos and fosters a collaborative culture

AIOps combines big data and machine learning to automate IT operations processes, including event correlation, anomaly detection and causality determination

.

Monday, 23 May 2022

Cloud Services - Get the Edge out of it post the Pandemic

As we head into 2022, we continue to feel the human toll of the global pandemic, but we already know it has been a watershed period in which attitudes and norms have permanently shifted — in our everyday lives and at work. Businesses have also changed. For many organizations, the pandemic has catalyzed digital business initiatives as we adapt to the demands of the new talent war, customer demand, who were forced into new digital options. B2B purchasers are happy to buy digitally, without a sales representative; B2C consumers are buying off social media platforms; Employees are physically distributed and communicating asynchronously; IT infrastructures must securing the “anytime, anyway, anywhere” way in which we’re operating; But to this age, we have seen customers esp. Government entities ask for on-prem solutions as the cloud is still perceived to be external and unsecure. There are still concerns on data privacy and a comfort that own data centre hosted applications can be hugged and secured with more controls. The “time to market” lag of an on-prem deployment (given the current supply chain issues due to the Pandemic) over a Cloud born deployment is expected to be anywhere between 3 – 6 months, and the pay as you model may allow your CAPEX project expenses for a medium web/mobile application to be reduced by 30 to 35% of overall project budget. A simple web and mobile deployment architecture for once of the CSPs is attached. More or less the same advantages can be met with other mature CSPs as they have equivalent Cloud services for web/mobile deployment.

Saturday, 16 April 2022

Metaverse - from Employee Meetings to Manufacturing Line Efficiencies; from Digital Twins to Avatars.

Metaverse is a massive topic at present and is disruptive in bringing the physical and digital worlds together to create new sources of Business value. But this is not VR alone, or XR or AR or MR or NFTs or EMG gesture control or BCI (brain computing interface), or Digital Twins alone. Its a combination of things and it is the layer of digital content that connects the virtual 3D world with the real world, that can be accessible from VR headsets, AR/MR headsets, mobile phones, laptops and desktops. Of course, for highly immersive experiences, it is ideal to have a visual optical experience and hence a huge amount of research is also being done by the Tech Titans (like Google and Facebook) on optical technologies - how do you build the thinnest and the most fashionable eyewear and EMG handwear that will allow you the "Beyond experience" The transformational use cases are still evolving, and true to the term META which means BEYOND, the topic is constantly evolving, growing massively. What is estimated about the Metaverse growth, is as it is about People being the centre of Technology more than anything else, the growth will be what history may not have witnessed before, and with 5G bandwidth enablement, expected to touch 5B users and have an impact of around 10 Trillion USD by 2030, the penetration currently is less than 30%. A few use cases that we can see already being used are: 1) Effective onboarding of employees when remote, and when large scale growth reducing inefficiencies, time and cost, build deeper connections with new hires. The efficiencies, the immersiveness and the impact the Metaverse platform has had on employees is proven by some companies already. 2) Creating Digital Twins of physical spaces, meeting rooms, offices that would allow you to view and take vantage decisions that were hitherto not possible. 3) Employee collaboration with digital avatars moving onto common virtual spaces that are digital twins of the conference room of your office, or any other place you want all the avatars to meet in an immersive experience. 4) Training and Learning experiences in a Digital twin environment (for example a digital twin of shop floor or on the interior of the Aircraft) takes learning and retention of knowledge to a whole new level of experience. 5) Having a digital twin of Manufacturing lines allows efficiencies, to identify issues in the shopfloor, or supplier chain issues resulting in a pileup. 6) Interaction with customers can be done at a completely new level new-gen Business, for example a car manufacturer through digital assets (NFT) can have the complete history of the car tracked, and when the car is being transferred to the next customer can have a share of the revenue. The use cases are just endless. 7) NFT based economy from digital real estates to Art collections is opening a completely new unforeseen world of opportunities. 8) Having digital twins of real estate properties, and virtual visiting of places with an immersive collaborative experience where people can interact will take the business-client relationship to a positively higher level. The advantages are visible and significant, if organisations can start small, start experimenting and when the tip of the transformation happens such organisations will be ahead of the curve from a strategic, technology and business know-how point of view. Underlying Technologies are data and AI based technologies, data modelling, data Engineering, AI/ML/DL, computer vision, AR/VR/MR, holography, NLP, Blockchain and data science/Metaverse platforms such as Microsoft Mesh & Space, Unity3D, Unreal Engine, Amazon Sumerian, SparkAR, Cybernetics among others. The Metaverse is indeed a place of infinite possibilities.

METAVERSE

The metaverse, first depicted in the iconic The Matrix movie more than 20 years ago, is becoming a reality as brands look to merge the real and digital worlds into one, driving Web 3.0. The difference between Augemented Reality and Reality is reducing rapidly. Just what is the metaverse? It is a vision for a new environment to interact with other humans and bots to play games, conduct business, socialize and shop. Best described as a 3D World Wide Web, the metaverse aims to mimic the physical world with a digital facsimile and combines a myriad of technologies including but not limited to augmented reality, mixed reality, livestreaming, cryptocurrency and artificial intelligence. The metaverse moved from Hollywood fare to front-page news when Mark Zuckerberg declared Facebook a “metaverse company.” Microsoft laid out its “metaverse tech stack” to facilitate metaverse app development while Epic Games announced $1 billion of funding to support its long-term metaverse strategy. Now available on the New York Stock Exchange, Bloomberg Intelligence estimates that the market size for metaverse could reach $800 billion by 2024. While the metaverse may still be only a buzzword (only 38% of global consumers are familiar with the concept, according to a Wunderman Thompson Intelligence report), the time is now for brands to establish its roadmap for entry in this new CX universe. Wunderson Thompson breaks down the metaverse into four primary categories: MetaLives, which constitute ideas such as digital ownership and content creation through formats like digital art and nonfungible tokens; MetaSpaces, virtual venues or activations that blend aspects of the virtual world with the real one; MetaBusinesses, including the rise of “gamevertising,” where brands appear within the realm of VR/AR/MR MetaSocieties characterized by people closely wedding their real-life identities with channels like social media and cultivating “hyper-real identities online.” Metaverse Teleporatation Teleportation in the Metaverse is the act of users moving from one place to another without having to be physically there or use any physical means of transportation how we join Zoom meeting via a link but we’re doing that from our respective homes? Well, in the metaverse, instead of joining such meeting from our respective homes, we’ll actually be together in one space created by the host in the metaverse the host created this party space or gaming space, then invite his/her friends to join in the party or playing the games. Now, the friends won’t physically be there but they’ll turn up using their virtual reality device and appear. Since users will have their own avatar, the experience will feel so real that it seems they’re actually physically with the host in the space created by the host.

Saturday, 8 January 2022

The Four Quadrants of Digital Transformation, DevOps and SRE

Happy New Year Dear Friends
It intrigues me that people are still obsessed with finding the differences between SRE and DevOps and about the role they play in Digital Transformation. Hence this blog.



To be Digital, there is a need to look at 4 quadrants towards driving continuous improvements

1.      People – Our Employees – How do we “Really” empower our People assets, and how do we get their DISCRETIONARY EFFORTS kicking in.

2.      Customer – How do we understand the Customer intent even before our competitor does, how can we hyper-personalize offers so that customers see the best VFM (value for money) and come back for a REPEAT BUSINESS.

3.      Optimizing Operations – This is where the adoption of Automation/DevOps flows/360-degree Security/Site Reliability Engineering/ Elastic Infrastructure/ Containerization/Microservice Architecture etc. comes in. The underlying key tenets are to think granular, and to have the ability to deliver quicker and revert quicker in a fail-safe, blameless cultural environment. This is not about some piecemeal 3 month-consultant projects, but rather an in-depth Organizational cultural wholesome change movement; great Leadership and a sharp intent, and an ability to grasp several aspects of the full picture are required.

4.      Transforming the Service or Product – This is to use “Data” as the key asset to derive clear insights of the customer, this is about the Organizational culture becoming bolder and innovative. building partnerships to leverage ecosystems, failing fast, failing safe, having a test and run approach, learning, adapting, and continuously improving to deliver Cheaper, Faster, and Better.

 

Site Reliability Engineering (SRE) and DevOps share the goal of building a bridge between development and operations towards increasing higher Business Value.

·        SRE and DevOps share the same foundational principles.

·        SRE can be viewed as a specific implementation of DevOps.

·        They share the same goal of rapidly delivering reliable software.

What is SRE?

SRE, or site reliability engineering, is a methodology developed by Google engineer Ben Treynor Sloss in 2003. The goal of SRE is to align engineering goals with customer satisfaction. Teams achieve this by focusing on reliability. SRE is an implementation of DevOps, a similar school of thought. Google is also responsible for bringing these two methods together. In this article, we'll break down more of what this looks like in practice.

SLIs and SLOs

Reliability is a subjective quality based on your customers’ experiences. SRE allows you to measure how happy your customers are by using SLIs. SLIs, or service level indicators, are metrics that show how your service is performing at key points on a user journey. SLOs then set a limit for how much unreliability the customer will tolerate for that SLI.

Incident response

SRE teaches us that 100% uptime is impossible. Some amount of failure is inevitable. Because of that, incident response is a core SRE best practice. Responding to incidents faster reduces customer impact. But, you need the processes in place to enable this. There are many components to incident response, including:

  • Incident classification: Sort incidents into categories-based severity and area affected. This allows you to triage incidents and alert the right people.
  • Alerting and on-call systems: Determine people available to respond to incidents as needed. Set guidelines for who gets called and when. Make sure to balance schedules and be compassionate.
  • Runbooks: These are documents that guide responders through a particular task. Runbooks are particularly useful for incident response. They include things to check for and steps to take for each possibility. They’re made as straightforward as possible to reduce toil. Automating runbooks can reduce toil further.
  • Incident retrospectives: SRE advocates learning as much as possible from each incident. Retrospectives document timelines, key communications, resources used, relevant monitoring data, and more. Review these documents as a group. Use them to determine follow-up tasks or revise runbooks and other resources.

Error budgeting

Nobody expects perfection. Some amount of unreliability is acceptable to your customers. As long as your performance meets your SLO, customers will stay happy with your services. The wiggle room you have before your SLO is breached is the error budget.

Your error budget can help you make decisions about prioritization. For instance, services with lots of remaining error budget can accelerate development. When the error budget depletes, teams know it's time to focus on reliability. Through this decision-making tool, SRE allows operations to influence development in a way that reflects customer needs.

SRE culture

The cultural changes of SRE are as important (if not more) than the process changes. The cultural lessons of SRE include:

  • Blamelessness: When something goes wrong, it is never the fault of an individual. Assume that everyone acts in good faith and does their best with the information available to them. Work together to find systemic causes for the incident.
  • Psychological safety: Teammates feel secure. They should be comfortable raising issues and expressing concerns without retributions. This encourages creativity, curiosity, and innovation.
  • Celebrating failure: Incidents aren’t setbacks, but unplanned investments in reliability. By experiencing an incident and learning from it, the system becomes more resilient.

What is DevOps?

DevOps is a set of practices that connects the development of software with its maintenance and operations. Its name reflects these two parts: Development and Operations. DevOps originated from a collection of previous practices. These include the Agile development systemthe Toyota Way, and Lean manufacturing. The term DevOps became well-known in the early 2010s.

The primary goal of DevOps is to reduce the time between making a change in code and that change reaching the customers, without impacting reliability. It seeks to align the goals of development with organizational needs to create business value. In this way, the goals of SRE and DevOps are very similar. Both focus on customer impact and efficiency. But, the methods they use to achieve this vary.

Continuous Deployment

DevOps seeks to increase the frequency of new deployments of code. Faster, more incremental changes allow a more attuned response to customer needs. It also reduces the chance of major incidents caused by large, infrequent deployments.

Collaboration between development and operations

A core tenet of DevOps is to remove silos between development and operations teams. Rather than development “throwing code over the wall” for operations to handle, the teams work together throughout the service’s lifecycle.

Here are some DevOps practices that encourage cooperation between development and operations:

  • Alignment on goals:  Ensure both teams understand what they’re working towards. Shared roadmaps and agreed-upon metrics help with alignment. Use customer impact as a common priority.
  • Develop with operations in mind: Development and operations should collaborate on how development should proceed. Operations make suggestions that help them maintain the code in production.

Availability of data and resources

Monitoring data for DevOps is a big deal. DevOps advocates measuring valuable data and using it as your basis for decision-making. By default, data should be accessible across the organization.

Simply having a lot of data available isn’t enough to make good decisions. Metrics should be contextualized to provide deeper insights. Make sure that you're setting up monitoring that helps you learn about your system. Having too much data can actually make decision making more difficult.

Automate where possible

Like SRE, DevOps advocates for automating wherever possible. Where SRE focuses on automating to increase consistency and reduce toil, DevOps automates to tighten the development cycle. By removing manual steps in testing and deployment, teams can achieve a faster release frequency.

How SRE connects to DevOps

You can implement both DevOps and SRE into your organization. A helpful way to combine the methodologies is to consider SRE as a way to achieve the goals of DevOps. But SRE is much more than development and deployment automation, its about working in a continuum of the system ecosystem to deliver operational excellence and increased reliability. Focusing on the goals of DevOps instead of the process-focused approach of SRE is also helpful. Drawing from both methodologies as appropriate provides the best way forward.

SRE as an implementation of DevOps

SRE is a method of implementing the goals of DevOps. Here are some of the common goals of DevOps, and how SRE practices can help achieve them:

  • Remove silos: SRE achieves this by creating documentation that the entire organization can use and learn from. Lessons from incidents are fed back into development practices through incident retrospectives.
  • Change gradually: SRE advocates incremental rollouts and A/B testing. This effectively makes the change more gradual, achieving the same goal of reducing the impact of failure.
  • Use tools and automate: many SRE tools reduce manual toil. Whenever you automate or simplify a process, you reduce toil and increase consistency. You also accelerate the process, achieving DevOps goals.
  • Metric-based decisions: SRE practices encourage monitoring everything and then constructing deep metrics. These will give you the insights you need to make smart decisions.
  • Accept failure: Not only does SRE accept failure, it celebrates it and utilizes it. By strategically using error budgets, you can accelerate development while maintaining reliability.

DevOps determines what needs to be done, whereas SRE determines how it will be done. DevOps captures a vision of a system that is developed efficiently and reliably. SRE builds processes and values that result in this system. You can establish your goals using DevOps principles, and then implement SRE to achieve them.

SRE vs. DevOps philosophy

SRE and DevOps share many philosophies and principles. Some that they share include:

  • Placing value on collaboration across teams, particularly between development and operations
  • Automation and toil reduction are key to increasing consistency and helping humans
  • Improvement is always possible. There is always value in reviewing and revising policy
  • Customer satisfaction is the most important concern. It’s the motivator for developing quickly and reliably
  • Sharing knowledge, whether through monitoring data, incident retrospectives, or codified best practices, is key to making good decisions
  • Failure is inevitable, and something to embrace and learn from

However, SRE and DevOps also have some differences in philosophy. Often these come down to priority. Some differences include:

  • DevOps advocates for a fluid approach to problem-solving. SRE creates codified and consistent processes.
  • SRE implements practices such as chaos engineering to further increase reliability. DevOps is more focused on the development lifecycle, so these extra practices don’t typically emerge.
  • SRE generally advocates for lower risk tolerance than DevOps. Working under metrics like SLOs, SRE will implement policies such as code freezes to avoid a breach. DevOps is more comfortable adjusting standards of reliability as development requires.
  • DevOps usually operates with improving development speed as a primary goal. SRE considers increased development velocity a byproduct of error budgeting and better incident response.
  • Both SRE and DevOps have a major focus on automation, but SRE’s approach is more widespread. DevOps primarily automate to increase development speed and focus on steps in the development cycle. SRE automates any processes it can, from chaos tests to incident management.

SRE vs DevOps teams

When implementing either SRE or DevOps in your organization, you’ll need to consider how these changes will actually take place. Will you:

  • Build policies and procedures collaboratively and rely on everyone to follow them?
  • Assign implementation duties to particular engineers in addition to their normal tasks?
  • Reallocate engineers to be on a team wholly devoted to rolling out new procedures?
  • Hire new engineers to build out your implementation team?

Structures for SRE and DevOps teams

Both DevOps and SRE teams vary based on how centralized they are. At one end is a centralized team, which creates tools, infrastructure, and processes that the entire organization shares.

The other extreme is a distributed team. DevOps/SRE engineers are assigned to individual teams and projects. They handle maintaining the reliability and velocity goals for each team.

 In conclusion, depending on the maturity of your organization and your needs, different approaches will be more efficient. You should consider how you want to structure your DevOps and SRE teams but the big picture around the 4 quadrants of Digital Transformation towards Value creation, People, Customers, Operations, and Services remains the central theme of such transformations.


From a Software Engineer to a CTO