Oct 15, 2025
Monitoring services: A Deep Dive in AWS Resources & Best Practices to Adopt
0 min read
Share:
Monitoring cloud environments is a million dollar business. But you don't necessarily have to buy a third-party service to monitor AWS environments. AWS has a number of offerings that allow you to build robust monitoring of your environment, mainly in the form of AWS CloudWatch and related services.
For site-reliability engineers (SREs) monitoring is a fundamental concept that underpins a lot of the SRE work. Monitoring is what brings a view into how your cloud environment and your applications are working.
In this blog post we will dive deep into what monitoring services that AWS offers, how you can manage these using Terraform. We will cover best practices around monitoring on AWS, and see how Anyshift and Annie can help us make sense of all our monitoring data.
What are monitoring services on AWS?
At a high level, relevant monitoring services on AWS are:
Amazon CloudWatch: this is the main service for logs, metrics, alerts, dashboards, etc.
AWS CloudTrail: this is the audit logging service for all activity taking place in your AWS environment.
AWS Config: this is a resource configuration tracking service that allows you to set up rules around how resources can be configured. Think of AWS Config as a policy-as-code tool, even if it does not use traditional policy-as-code languages.
There are also a few additional services under the monitoring umbrella that will not be covered in this blog post. These are Amazon GuardDuty for real-time threat detection in your AWS environment, AWS X-Ray for tracing, Amazon Managed Grafana and Amazon Managed Service for Prometheus. The last two offer a managed experience for the popular monitoring combination of Grafana and Prometheus which is part of many cloud-native architectures today.
Monitoring services are used to understand, audit, and protect your AWS environment. Most monitoring is built on metrics and logs. Metrics are numeric data points collected at a given interval, e.g. the CPU utilization percentage of an EC2 instance every minute. Logs consist of text based, often structured (e.g. JSON), data that can contain arbitrary information from AWS services and your applications.
Successful monitoring is more about making sense of the data you are collecting, rather than collecting as much data as possible. In a sense, monitoring is more an art than a science. Analyzing and understanding the metrics and logs coming from your AWS services and applications require skill and experience. But even with years of experience it can be difficult to detect certain signals in your data. This is one area where an AI SRE agent can greatly improve the chances of detecting hidden details in your data.
Managing monitoring services on AWS using Terraform
In this section we will go through how to configure some of the most common monitoring services on AWS using Terraform.
Managing CloudWatch using Terraform
There are various types of resources to manage under the CloudWatch service. In this section we will concentrate on the most common types of resources that you will need for most (or even all) applications you provision: log groups and alerts.
A CloudWatch log group is a collection of related logs from an application or service. A log group contains one or many log streams, where the actual logs are recorded in order. You can use CloudWatch Log Insights to query your logs for information you want to find.
It is easy to create a new log group using Terraform, but it should be noted that often you do not need to create log groups explicitly because they would be created automatically by the service or application creating the logs (assuming they have permissions to create log groups).
To get full control over managing your log groups, you should configure them yourself. This also allows you to be explicit around the IAM permissions for the log groups that you create. You create a log group using the aws_cloudwatch_log_group resource type:
An important argument is retention_in_days. This argument allows you to only store the logs for a required time period. Setting this to an amount of days that is appropriate for your applications is a good FinOps practice, instead of blindly retaining all the logs for an indefinite amount of time.
You can use the log group from whatever application or service you want to log to this group. An example of using this log group for a Lambda function is this:
Applications must have an IAM role that allows them to create log streams in the log group and to put logs into these log streams. These permissions are built into the standard IAM policies for AWS Lambda functions. There are also other CloudWatch specific policies you can use.
The other major part of CloudWatch are alarms. You should set alarms for metrics that indicate that there are issues in your applications or infrastructure. Knowing which alarms to create is more an art than a science, so it might be difficult to create them upfront. Over time as you learn about managing your infrastructure you will know what alarms to add.
There are two types of alarm resources:
Metric alarm: an alarm monitoring the numeric value of a given metric name. The numeric value is often modified in some way, e.g. you might be interested in the average value of the metric over the past five minutes or the maximum value during the past hour.
Composite alarm: an alarm based on two or more other alarms. This is based on the same type of metric alarms as above.
You create a basic metric alarm using the aws_cloudwatch_metric_alarm resource type:
To understand how to configure an alarm you need to have a grasp of the following concepts:
Comparison operator: this is how you should compare the measured metric value to a specified threshold. In the example above we used GreaterThanOrEqualToThreshold. The name of the comparison operator clearly indicates what it means. Other common comparison operators are e.g. GreaterThanThreshold, LessThanThreshold, and LessThanOrEqualToThreshold.
Namespace and metric name: each AWS service publishes metrics to a namespace (e.g. the EC2 service publishes metrics to the AWS/EC2 namespace). Each metric has a name (e.g. CPUUtilization) in this namespace.
Threshold: what value of the metric where you want to trigger alarms.
Statistic: this is how metric values are measured. You could compute the Average value of the metric over the period, or look at the Minimum or Maximum value over this period. Other valid statistics are Sum and SampleCount.
Period: a time period over which the measured metric is evaluated.
Alarm actions: what should happen when this alarm is triggered? This allows you to build automation for automatic remediation. Two common alarm actions are invoking Lambda functions and sending a message to an SNS topic.
There are even more concepts you will need when fully managing alarms using Terraform. However, for brevity, other concepts are omitted in this discussion.
A composite alarm is basically used to act based on two or more other alarms. An example of a composite alarm:
The main argument of interest is alarm_rule. Here you write an expression that will determine when this composite alarm is triggered. The expression contains references to other alarms. You can use boolean expressions such as AND,OR and NOT together with functions to check the state of a given alarm such as ALARM, OK, or INSUFFICIENT_DATA.
A good use case for composite alarms is when a single alarm isn't necessarily enough to indicate a problem, but multiple alarms is a clear indication that something must be done. Another good use case is to minimize the amount of places where you configure what actions should happen for alarms. In the example above we only need to configure the alarm and OK actions once in the composite alarm instead of in each separate alarm.
Managing CloudTrail using Terraform
If you have a large AWS environment you should be using the AWS organization service. A good practice is to enable CloudTrail logging for your organization and all accounts in it. The CloudTrail logs should be stored in a central S3 bucket that is inaccessible to most teams except for your security and compliance teams.
You should ideally also replicate the CloudTrail logs to a different bucket in a different AWS region, and possibly also located in a different AWS account. This is to ensure you have access to audit logs in case of an issue in the primary AWS region.
In the following code example we will focus on enabling CloudTrail for a single account.
The cheapest option for storing CloudTrail logs is to use Amazon S3. There is also an option to ingest the logs to CloudWatch, but note that this will be much more expensive. Especially if you have a lot of activity in your AWS account. The positive side of exporting the logs to CloudWatch is that it will be much easier to query the logs for information using CloudWatch Logs Insights.
To store the logs in an S3 bucket, first create the bucket:
We will need to configure a bucket policy to allow CloudTrail to store logs and perform other necessary actions on the data in this bucket. Create the bucket policy and attach it to the bucket:
Setting the include_global_service_events to true means we will get events from services that are considered as global (e.g. IAM).
Once these resources are provisioned, all future actions in your AWS account will be recorded and stored in the S3 bucket.
Managing Config using Terraform
AWS Config is built around different rules that you apply to the resources in your AWS environment. Each rule checks a given configuration on one or more resource types.
There are AWS managed rules and custom rules. In this blog post we will focus on AWS managed rules. Custom rules involve writing an AWS Lambda function where you define the logic of the rule. Custom rules can be complex to implement.
There are a large number of managed rules in Config available for use.
To start working with Config you must create a configuration recorder resource:
This resource requires an IAM role that Config can use to manage the configuration recorder and related resources on Config. Create a new IAM role with the required assume-role policy:
The configuration recorder resource supports a number of configurations that can be of interest. If you want to exclude certain types of resources to be managed by Config you can specify these settings in the recording_group block:
There are additional settings you can configure to limit how often Config scans certain resource types if required. This can be important if you want to limit Config's cost. In the Config service you pay for the number of resources you scan and for how many scans you run.
With the configuration recorder in place you can start provisioning the rules you are interested in enforcing in your environment. An example of using an AWS managed rule for requiring S3 buckets to have versioning enabled looks like this:
You can find a list of AWS managed rules in the documentation.
We have seen only a small subset of the available monitoring resources in the Terraform provider for AWS. But by configuring CloudWatch logs and alerts together with both AWS CloudTrail and AWS config you will have a great start for monitoring your AWS environment.
Best practices for monitoring services
There are many best practices around monitoring services on AWS. Some of these are highlighted below.
Don't forget monitoring costs in your FinOps analysis
From all the costs related to monitoring, log ingestion tends to be the most surprising and one of the most expensive parts. You pay for the amount of GB of log data that are ingested to CloudWatch from all your AWS services and custom applications. Log storage tends to be a much cheaper cost item.
Keep a close eye on the cost of CloudWatch, and specifically the log ingestion costs. If your spend on log ingestion increases it could be the result of logging more than what is needed. A common issue is that you use the wrong log level in your application, e.g. logging debug logs in your production environment. This could quickly escalate the costs.
Identify your Key-Performance Indicators (KPIs)
Not all the data collected on CloudWatch will be of value to you. You need to apply your knowledge and experience of your applications and services to identify what data is important. This usually boils down to a number of KPIs that are most important. These are the metrics and logs you should primarily focus on in your alerts and monitoring dashboards.
What your KPIs are will differ depending on your context. Typical examples include application response times, number of queries per second in a database, the number of clicks on a website, the CPU and memory utilization of your Kubernetes cluster worker nodes, and more.
Protect your audit logs in CloudTrail
The audit logs you collect with CloudTrail are usually not intended to be viewed by others other than your security teams, platform teams, and your auditors. This data should not be available to anyone.
Using AWS organizations you can dedicate a single AWS account to collect all CloudTrail logs from all of your accounts. No one else than those listed above should have access to this account.
On top of this you must replicate the CloudTrail logs to a different bucket in a different region for backup and for making sure the logs are available in the case of an issue with the primary region.
CloudTrail logs also support log integrity validation, which can be used to discover if someone has tampered with the CloudTrail logs.
Focus on managed AWS Config rules
AWS Config offers a lot of flexibility through its custom rules. However, with great flexibility comes great management overhead. You should focus on using the managed rules provided by AWS to scale your usage of Config without incurring additional management overhead.
Don't rely solely on human reaction
CloudWatch alarms allow you to build automation to act when a metric is outside its usual values. You can trigger a Lambda function or any other type of automation platform that you are using. You should not rely on humans reacting to your alarms, because this will lead to slower responses and more error-prone actions.
You can take it one step further to also use AI for reasoning around the metrics and logs that you are collecting. Perhaps there are issues in your data that are just waiting to be discovered. It is difficult to detect a signal in your data if you don't know what you are looking for.
Set up centralized logging for improved insights
In a distributed microservice environment application logs will be less useful if they are stored isolated by application. Requests from end users often cause a number of internal calls between services. To get the full insight into what is happening in your environment you must provide a solution that can query logs from all applications that are involved.
To achieve this you can share CloudWatch Logs across accounts and use CloudWatch Logs Insights to query the relevant log groups.
If you are using a third-party monitoring platform you should export the CloudWatch logs to this solution. Remember that you will have a cost for ingesting logs into CloudWatch, and another cost for exporting the data out from AWS (unless your third-party monitoring platform is hosted on AWS as well).
Only enable the monitoring that you require
There are a number of configuration options you can utilize to limit what data is monitored for your AWS resources. For instance, RDS instances and EC2 virtual machines allow you to enable detailed monitoring. Only enable this where you have a use case for this data. More monitoring data is most often only an increased cost, without any benefit.
If you require logs to be exported from your EC2 instances, then install the CloudWatch logs agent on these machines and set up log export to CloudWatch. However, do not necessarily set this up for each instance unless you specifically have a need for it.
For CloudTrail, there are additional features you can enable. One of these features is CloudTrail Insights which will analyze your common patterns to be able to see if any action goes outside the norm. You pay per the number of events that CloudTrail Insights scans, which can turn out to be a high cost if you have a lot of activity in your accounts. If this feature is not strictly necessary for your environment, then do not enable it.
Finally, reduce the retention of your log groups to avoid storing logs for longer than what is required. Most log data is old and useless after a few hours. You can process the logs to extract any insights you need (like how much log data was recorded during a given period) and then discard the data if no longer needed. Even if the storage of logs is not the most expensive post (log ingestion is) it will slowly increase the more data you have.
Terraform and Anyshift for monitoring on AWS
If you are using Datadog as a third-party monitoring tool instead of the native AWS services (e.g. CloudWatch), you can connect your Anyshift environment to it and ask questions related to the monitoring data sent to Datadog.
Currently native AWS monitoring services are not supported on Anyshift. However, since Anyshift has insights into your cloud environment and how you have configured it using Terraform you can get great insights into what you can do to improve your monitoring situation and the costs associated with it.
As has been mentioned earlier, the costs associated with logging in CloudWatch can come as a surprise. What happens if we ask Annie about what we can do to improve our CloudWatch costs:

We get a breakdown of our current CloudWatch environment with the top log groups and alarms. Following this we learn about the top cost optimization strategies that we can implement to reduce the costs.
When working with infrastructure as code you should ideally have all your resources defined as code. What can Annie tell us about the infrastructure as code coverage of our CloudWatch environment?

We learn what we need to do to bring these resources under management. Doing this improves the context for Annie to be able to give even more accurate suggestions around your AWS environment.
Visit the documentation for more on how Anyshift can help you understand your context better.
Conclusions
Monitoring services on AWS is a great start for any application running on AWS. You should evaluate if the monitoring services on AWS fulfills your needs for application monitoring before you reach for third-party tools.
The primary monitoring service on AWS is CloudWatch. CloudWatch has tools for building dashboards, setting up alarms, collecting and analyzing logs, and more. Two related monitoring services on AWS are CloudTrail and Config. CloudTrail collects audit logs from all your AWS accounts so you can see who did what and when it happened. Config allows you to set up policies for your resources and get a report for resources that do not fulfill these resources. You can also use Config to block users from provisioning resources that do not fulfill the policies (or rules).
Terraform can be used to manage every aspect of the monitoring services on AWS. This allows you to standardize templates for dashboards, alarms, config rules, and more, to easily apply it at scale in your AWS environment.