Applications today are increasingly complex. The cloud has turbocharged the pace at which companies can develop and deploy new features. However, monitoring and performance mistakes in the writing and deploying of new code can degrade customer experience, drive up costs, or interrupt service entirely.

Companies pour time and money into creating custom DevOps tools and alarms only to fall victim to the limitations of these systems when an anomaly occurs. Most IT teams have to manually sift through terabytes of data to locate the issue which leads to critical mistakes going unnoticed.

In the current vast and ever-growing infrastructure, maintaining the availability of revenue-generating applications and business-critical workloads can be very challenging, regardless of the size of the company or the market in which you operate.

Businesses are under constant pressure to operate and innovate faster and to do so at lower cost. Organizations are trying to put together different practices whether in the realm of DevOps or site reliability engineering (SRE) practices to manage these systems.  These practices seek to solve major problems and automate operational tasks to improve the speed and quality of application testing and delivery. 

What if there was a better way to find and fix operational problems and address all these challenges faced by your business? Amazon DevOps Guru is built to address these problems. It gives you an easy way to improve application availability using pre-trained machine learning models informed by years of Amazon and AWS operations to detect critical operational issues.

Leveraging AWS’ powerful infrastructure, DevOps Guru collects, analyzes and correlates operational data (such as metrics logs and events) to identify behaviors that deviate from normal patterns. DevOps Guru then alerts IT management of urgent operational risks, provides a summary of possible causes, and recommends solutions for remediation. In all, this saves hours of valuable debugging time.

DevOps Guru also offers a single console experience where managers can visualize issues in operational data in AWS cloud. They can then sort these events by severity, ensuring alerts are always prioritized to the most pressing needs. Best of all, DevOps Guru requires no previous machine learning or DevOps experience. It automatically analyzes data from the existing AWS account and automatically updates the evolving system architecture. With just a few simple clicks in the AWS Management Console, DevOps Guru can begin analyzing application activity and keep applications running smoothly.

Amazon DevOps Guru saves you hours, if not days, of time and effort spent detecting, debugging, and resolving operational issues and enables engineers to effectively monitor complex and evolving applications. It helps avoid common oversights and errors in monitoring, such as missing alarms, which cause application downtime. DevOps Guru is built to achieve the goals of

  1. Easy Setup and Ease of Use
  2. Quick Resolution
  3. Reduction in Downtime
  4. Prevention of Revenue Loss
  5. Keeping Your Business Running

When operational issues occur, DevOps Guru saves debugging time by fetching relevant and specific information from a large number of data sources. DevOps Guru generates Operational Insights to alert management of the issue, with a summary of related anomalies, contextual information about why and when the issue occurred along with recommendations on how to remediate issues. This reduces application downtime and reduces the Mean Time to Recovery (MTTR.) In some cases, it entirely prevents downtime by alerting hours or days before downtime occurs.

Features & Benefits:

Automatically detect operational issues: DevOps Guru continuously analyzes streams of disparate data and monitors thousands of metrics to establish normal bounds for application behavior. It discovers and classifies resources such as application metrics, logs, events, and traces in your account, automatically identifies deviations from normal activity, and surfaces high severity issues to quickly alert you of downtime.

Resolve issues quickly with ML-powered insights: DevOps Guru helps reduce your issue resolution time and assists in root cause identification by correlating anomalies in metrics with operational events. When an operational issue occurs, it generates insights with a summary of related anomalies, contextual information about the issue, and when possible actionable recommendations for remediation.

Easily scale and maintain availability: As you migrate and adopt new AWS services, DevOps Guru automatically adapts to changing behavior and evolving system architecture. With DevOps Guru, businesses save time and effort otherwise spent on monitoring applications and manually updating static rules and alarms. In just a few clicks, DevOps Guru starts an in-depth analysis of your AWS application.

Reduce noise and alarm fatigue: DevOps Guru helps developers and IT operators overcome alarm fatigue by automatically correlating and grouping related anomalies to reduce alarm noise and surfacing the most critical alerts. With DevOps Guru, you no longer need to manage multiple monitoring tools, which means a greater focus on root cause and resolution.

DevOps Guru flow of Work:

62d1345e35f0568127fc9167 Picture1
  1. To Analyze all AWS resources in the current AWS account in the given Region.
  2. To Analyze all AWS resources in the specified CloudFormation stacks (up to 500) in the given Region.

Ingesting data from Data sources: DevOps Guru ingests data from the following data sources

  1. AWS CloudWatch
  2. AWS CloudWatch Events
  3. AWS Config
  4. AWS CloudTrail
  5. AWS X-ray
  6. AWS CodeDeploy
  7. AWS CodePipelines
  8. AWS Systems Manager

DevOps Guru needs an IAM role with permissions to all the above services to ingest necessary data and generate insights.

Data Analysis: DevOps Guru starts to automatically ingest and analyze metrics like latency, error rates, and request rates for all resources to establish normal operating bounds. Then it uses a pre-trained machine learning model to identify deviations from the established baseline. 

Data Enrichment: When DevOps Guru identifies anomalous application behavior (like increased latency, error rates, or resource constraints) that could cause potential outages or service disruptions, it alerts operators with issue details. These details include the resources involved, the issue timeline, and other related events to help operators quickly understand the potential impact and likely causes of the issue. It also provides options for remediation or mitigation. 

Developers can then use those suggestions from DevOps Guru to reduce time to resolution when issues arise and improve application availability and reliability.

Integrations: DevOps Guru can be used as a standalone service, and also integrates with partner applications from PagerDuty and Atlassian along with AWS System Manager Ops Center.

DevOps Guru Dashboard:

DevOps Guru provides a centralized dashboard where you can see all the information about your monitored resources and ongoing alarms, it even prioritizes the issues that need to be addressed first to reduce the impact on business.

It gives businesses a System Health Summary consisting of all the metrics analyzed, impacted stacks, ongoing reactive and proactive insights, and a System Health Overview that shows the health of all monitored resources.

DevOps Guru provides two types of insights, these are Reactive Insights and Proactive Insights, you can filter these insights with available filters like status, severity, resource name, or time range to find the insights you are looking for.

Reactive insights:

Reactive Insights display any ongoing or issues that have occurred in the selected time period. You can click on an insight name and quickly perform the proposed recommendations to resolve current issues. Inside a Reactive Insight, there is all the information operators need to resolve the issue. It has Aggregated Metrics to show you the information of anomalies arising from different resources in a timeline fashion. The same can be visualized graphically in the Graphed Anomalies section. It also consists of Relevant Events and Recommendations, which can help you understand the cause of the issue and solve it with the right actions immediately.

62d1345ef0c56e29dd467be8 natgateway

Proactive Insights:

The Proactive Insights section provides you with information about issues that might impact the health of your applications in the future. DevOps Guru populates these potential issues hours, or sometimes even days before it might occur. For example, if there is a server whose memory utilization is being monitored, DevOps Guru can analyze the patterns of usage and warn in advance of potential downtime that might occur due to reaching the memory utilization limit.

62d1345e8b875855444bd89f memory utilization


  1. SNS: DevOps Guru provides integration with SNS to get notified when an insight is generated.  In addition, you can use the same to integrate with Opsgenie to send notifications to the right team.

AWS Systems Manager Integration: This enables DevOps Guru to create an OpsItem in OpsCenter for each insight.

Supported AWS Services:

AWS service : ResourceAWS service : ResourceAPI Gateway : API Path/RouteKinesis : StreamApplication ELB : LoadBalancerNATGateway (VPC ) : NatGatewayCloudFront : DistributionNetwork ELB : LoadBalancerDynamoDB Streams : StreamRDS : DBInstanceDynamoDB : TableRedshift : Cluster, NodeEC2(ASG):Instance*Route 53 : HostedZoneECS : ServiceSageMaker : InvocationEndpointEKS : ServiceSNS : TopicElastic Beanstalk : EnvironmentSQS : QueueElastiCache : NodeStep Functions : Activity, StateMachineElasticsearch : NodeSWF : Workflow, TaskELB : LoadBalancer 

DevOps Guru use cases:

DevOps Guru is useful for all kinds of cloud customers. It can improve application availability and reduce time to resolution of incidents, preventing loss of revenue and keeping your business running.

Migrating Customers who are overwhelmed with the scale and scope of setting up operations with current applications will benefit from the out of box coverage. It requires no upfront configuration.

Long-term users of AWS who are driving to improve application availability and manage continuous innovation will benefit from the DevOps Guru’s ability to continuously scale and auto-calibrate.

Any type of customer can benefit from the Quick Resolution and Noise Reduction the system provides.

Efficacy: (Results of 2020 Beta Testing)

DevOps Guru captures 80% of critical operational issues, out-of-the-box, with no configuration required. Out of this it reduced resolution time in almost 70% of the operational incidents when compared to the traditional methods.


Encryption of data-at-rest: DevOps Guru uses the data retention policies of Amazon S3, DynamoDB, and Kinesis. Data stored in Kinesis can be retained for up to one year and depends on the policies set. Data stored in Amazon S3 and DynamoDB is stored for one year. Stored data is encrypted using the data-at-rest encryption capabilities of Amazon S3, DynamoDB, and Kinesis.

Encryption of datain-transit: All communication between customers and DevOps Guru and between DevOps Guru and its downstream dependencies is protected using TLS and authenticated using the Signature Version 4 signing process.

In short, DevOps Guru is one of the most powerful tools available to make sure your entire infrastructure runs as well as possible. Are you interested in learning more about how to keep your applications running as smoothly as possible? Ibexlabs has helped businesses around the world do exactly that. Contact us today!