alphalist Blog

How AIOps is the future of DevOps Automation

Share

‘PagerDuty raises 90 million to wake up more engineers in the middle of the night.’ read a recent headline. Highlighting both the recent funding and also its unpopular but necessary job. However, Alex Solomon of PagerDuty is working on making the DevOps world more efficient with on-call-management, AIOps, and automating routine debugging tasks.  In this episode of the alphalist.CTO podcast, Alex shares with us the importance of on-call management, the future of DevOps, and why we need AIOps to take the pain out of DevOps.

Table of Contents

TABLE OF CONTENTS - About DevOps - The Need for On-Call Management Tools - When Notifications Provide a Better Work/Life Balance - Ironman-Mode: AIOps and Runbook Automation - Future of DevOps - A Hybrid Monolith + Microservice Model? - Does Kubernetes have a future? - The Modern Demands of Engineers. - How to build up a service reliability engineering department as CTO

About DevOps.  Amazon was the forerunner in DevOps- even before it was called that. It believed in the concept of full-service ownership. A developer works on a small team building services or microservices and owns them end to end. It means that the developer is responsible for not just writing code, but testing it, deploying it to production. When something breaks in production, it is the developer who is get paged back to come to fix it as they are the ones that know it best. When Alex worked at Amazon, they would use pagers to alert developers that things were not working and this was called ‘ pager duty’. 

The need for On-Call Management Tools PagerDuty started in 2009 when Alex got together with 2 other former Amazon employees Andrew and Baskar and decided to make an on-call management tool just for companies using this DevOps concept that needed to have developers notified if the system went down. Large companies like Amazon, Google, and Facebook already had their internal on-call-management tools, yet Alex wanted to make it to the B2B market that provided on-call management and alerting.      Before PagerDuty came along, companies were either building their own internal product for DevOps or were still using the manual process of people sitting in an operations center or NOC watching screens and waiting for things to pop up. This framework does not work well in the DevOps world. When a company tried to scale a very software-centric kind of product or company they would run into limitations that cause many hours of business impact or customer impact. PagerDuty built a product to ensure that every single issue would get handled quickly by someone. In the case of a major incident that required a multi-team response, PagerDuty facilitates a quick response through initiating a chat channel or conference call which gets all relevant parties involved in real-time. This minimises the impact on the business.   

Although it started as just notifications yet has since evolved into a full end-to-end incident management - incident response system. It takes people from the alert to a post mortem on what went wrong.

When Notifications provide a better Work/Life Balance Even though engineers dread getting alerts from PagerDuty, at the end of the day it provides a better work/life balance. Without an on-call management system, the entire team would be paged, meaning they would all be on call all the time. This usually resulted in most people tuning out the alerts and leaving a handful of people who would actually pick up the gauntlet. An on-call management system means that one is only paged when the alert is relevant to them and a system they built and a challenge only they know how to solve. It also means that even when on call, one does not need to watch a screen the entire shift - they can go out and enjoy and be paged when needed. 

 Ironman-Mode: AIOps and Runbook Automation

PagerDuty is also working on AIOps which is about pattern recognition and making sense of all of the alerts and events and grouping related events and alerts together. As PagerData is an alert aggregator, they have access to a lot of data. They have thus invested in AIOps which entails making sense of all this data, filtering out the noise helping customers understand the problem so they can get to a resolution faster. It also helps to understand the context -like is it an isolated problem impacting just one service or cascading failure that is impacting a lot of different services? A systematic failure or one caused by a third-party provider? 

  Using this data, one can set up an automation to perform common debugging steps. 

PagerDuty recently acquired a company called Rundeck, which is all about automated remediation or runbook automation.

 Imagine you are a person who's on the front lines responding to issues. Wouldn’t you love to have a toolbox of actions at your fingertips - perhaps there is a script that you can run or, or set of actions that you can run to resolve that incident in a repeatable way? PagerDuty calls this Ironman-mode - a person gets paged for something and then has this set of tools that they can quickly invoke to get additional diagnostics, like run some debugging scripts, show the logs, give debugging and diagnostic data for additional context to understand the problem and triage the problem. Then there is a toolbox of actions that you may want to take towards fixing the actual problem eg. restart the server or take the server out of a load balancer.   Perhaps this can be automated end-to-end so no one needs to be paged in the first place. One can even use machine learning to recognise patterns and have the automation be like ‘Hey, you've run this action in the past. Can we just do it for you now?’.  This will allow companies to do more with less. AIOps allows companies to become more operationally efficient and not have to hire as many people to handle operations and SRE and DevOps. 

Future of DevOps Companies are moving away from a monolithic architecture to microservices. There is no longer one large monolithic application built on infrastructure with a lot of shared services in terms of database and compute. Technology is moving towards service-oriented architectures where a monolith is broken into services that talk to each other. With microservices, the service unit has gotten smaller and easier to understand, and more isolated. Plus we are now seeing a lot of serverless as well which means a service becoming a collection of functions. So while each unit has gotten smaller and easier to understand for a person or a team, the entire system has gotten more and more complex.   Now distributed teams own those distinct units with end-to-end ownership. Full-service ownership is a core concept in DevOps - you build it, you maintain it, you get paged for it.

The system is now a large distributed system with all of these services or microservices talking to each other, and it goes beyond the single human capacity to understand the entire system and have it loaded in their brain. 

The only way you can manage that level of complexity is by leveraging tools, data, and things like a service directory or service mesh tools (e.g. Datadog, PagerDuty, HashiCorp, AWS tools, etc.) 

A Hybrid Monolith + Microservice Model? With a monolith, you can run a single query and it's a single database and you have the world at your fingertips. In the service world, you have to call all these APIs and stitch the data together. Of course, this isn’t so bad if you use GraphQL, API Gateways, a great frontend, and back end. Which one is better? Companies do their own thing. Shopify has multiple monoliths and they invested in their upkeep. Sometimes, as in the case of PagerDuty, the monolith serves as a legacy system and they have separated parts of it into services that now serve the monolith. These services are designed for scale, failure tolerance, and separation of concerns. One can layer on a backend, front end, or graph QL layer on top - which manages the complexity of all these services. In this way, they use both. With both microservices and monoliths, there is no free lunch - one has to invest in learning them and maintaining them.

Does Kubernetes have a future?  Tobias asked Alex, given what he knows about DevOps combined with his background in Ruby, will Kubernetes - and its steep learning curve- still be around in 10 years?  Alex claims that yes. According to Alex, it usually takes about 10 years for trends in architecture to play out, such as the trend from monolith to service-oriented to microservices to serverless or the trend from bare metal to virtualization to containerization. Even now, although a lot of large companies use Kubernetes, they're still going to have virtualized stuff, as well as the legacy on-prem bare metal hardware.

Another trend that will impact the future of Kubernetes is that lately, the cloud providers (AWS Azure, GCP, etc) have all made it easier to run a Kubernetes cluster. These cloud providers run the cluster for you so you don't have to deal with the complexity of versions, upgrades, compatible upgrades or non-backward compatible upgrades, etc. They just do it for you and they offer it to you as a service. It's not completely free, but it removes a lot of that operational load from you so that you can just rely on their service. So to answer Tobias’s question: Will Kubernetes still be around in 10 years? Does Kubernetes have a future? In 10 years, Kubernetes will still be around you probably won’t need to run it yourself

The modern demands of Engineers.  Software still leads the way and thus DevOps is here to stay. An engineer has to become a mixed martial artist. They need to know not just how to write code, but also understand how that system works in production. An engineer needs to understand distributed systems and distributed architectures. What are their SLA/SLOs? Is it a tier-one business-critical system - if it goes down would revenue flow for the company stop (e.g. a shopping cart) or is it more of a tier-two system which means that it goes down, you have a lesser degree of performance?  Cloud providers are making it all easier though. AWS now provides Aurora as a distributed database and it works pretty much all the time and you don't have to run your own SQL, backups, and multi-cluster systems, which can get very complex. 

How to build up a service reliability engineering department as CTO - If you are starting from scratch - Hire for Experience - Your First Hire should be experienced in leading a team using the cloud technology of your choice. -  Use PaaS tools - If you are starting from scratch, save yourself a lot of time and use the latest technologies that a cloud provider offers. You are better off focussing on the product-market fit and having the cloud provider manage the DevOps for you.  - If you are switching from legacy to cloud. - Hire someone with hybrid experience - someone with experience leading teams on both legacy and cloud platforms using the cloud technology of your choice. - Start from now - If you already have legacy platforms, build all new applications on the cloud and make a migration plan for the rest of your systems. Some systems will be easy to migrate to the cloud and others will take longer in a phased approach. 

Tobias Schlottke

Tobias Schlottke

CTO @ saas.group

Tobias Schlottke is the founder of alphalist, a community dedicated to CTOs, and the host of the alphalist CTO podcast. Currently serving as the CTO of saas.group, he brings extensive experience in technology leadership. Previously, Tobias was the Founding CTO of OMR, notable for hosting Germany's largest marketing conference. He also founded the adtech lab (acquired by Zalando) and the performance marketing company adyard, which was sold to Ligatus/Gruner + Jahr in 2010.