ERROR

Troubleshooting AWS SSM Agent Connection Failures Triggering EC2 Monitoring Alerts

Quick Fix Summary

TL;DR

Restart the SSM Agent service on the affected EC2 instance.

The SSM Agent is not communicating with the AWS SSM service, causing health checks to fail and triggering CloudWatch alarms.

Diagnosis & Causes

Missing or incorrect IAM instance profile permissions.

SSM Agent service is stopped or crashed on the instance.

Network connectivity issues (e.g., security groups, NACLs, VPC endpoints).

Recovery Steps

Step 1: Verify SSM Agent Status and Connectivity

Check if the SSM Agent process is running and can reach the SSM service endpoints.

bash

# Check SSM Agent service status
sudo systemctl status amazon-ssm-agent
# Check for agent process
ps aux | grep -i amazon-ssm-agent
# Test connectivity to SSM endpoints (replace region)
nc -zv ssm.us-east-1.amazonaws.com 443

Step 2: Restart and Re-register the SSM Agent

Restart the agent service. If the issue persists, force a re-registration with the SSM service.

bash

# Restart the SSM Agent service
sudo systemctl restart amazon-ssm-agent
# Force agent re-registration (if restart fails)
sudo /opt/aws/amazon-ssm-agent/bin/amazon-ssm-agent -register -y -region "us-east-1" -i "i-1234567890abcdef0"

Step 3: Validate IAM Instance Profile Permissions

Ensure the EC2 instance's IAM role has the necessary SSM managed policy attached.

bash

# Describe the IAM instance profile attached to the instance (from AWS CLI)
aws ec2 describe-instances --instance-ids i-1234567890abcdef0 --query 'Reservations[0].Instances[0].IamInstanceProfile'
# Check attached policies for the IAM role (replace RoleName)
aws iam list-attached-role-policies --role-name MyEC2SSMRole

Step 4: Check Network and Security Configuration

Verify that the instance's security group allows outbound HTTPS (443) traffic and that VPC endpoints (if used) are correctly configured.

bash

# Describe security groups for the instance
aws ec2 describe-instances --instance-ids i-1234567890abcdef0 --query 'Reservations[0].Instances[0].SecurityGroups'
# Check VPC Endpoint status (if using Interface endpoints)
aws ec2 describe-vpc-endpoints --filters "Name=vpc-endpoint-type,Values=Interface" "Name=service-name,Values=com.amazonaws.us-east-1.ssm"

Step 5: Reinstall the SSM Agent

As a last resort, reinstall the latest version of the SSM Agent.

bash

# For Amazon Linux 2 / RHEL / CentOS
sudo yum remove -y amazon-ssm-agent
sudo yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
sudo systemctl start amazon-ssm-agent

Architect's Pro Tip

"This often happens after an instance is stopped/started or its IAM role is modified. The agent's internal registration can become stale. A restart (Step 2) usually clears the state."

Frequently Asked Questions

How can I prevent this alert in the future?

Implement a CloudWatch alarm based on the SSM Agent heartbeat metric (`AWS/SSM/AgentHeartbeat`) instead of generic instance status checks. Also, ensure your EC2 launch templates/AMIs have the latest SSM Agent pre-installed and use an IAM role with the `AmazonSSMManagedInstanceCore` policy.

The instance is in a private subnet. What should I check?

Verify that VPC endpoints for SSM (`com.amazonaws.region.ssm`), EC2 Messages (`ec2messages`), and SSM Messages (`ssmmessages`) are created in the VPC and that the route tables for the private subnet direct traffic to these endpoints. The security group attached to the endpoints must allow inbound TCP 443 from the instance's security group.

Related AWS Guides

AccessDeniedException

Troubleshooting AWS SSM Agent Connection Failures Triggering EC2 Monitoring Alerts

Quick Fix Summary

Diagnosis & Causes

Recovery Steps

Step 1: Verify SSM Agent Status and Connectivity

Step 2: Restart and Re-register the SSM Agent

Step 3: Validate IAM Instance Profile Permissions

Step 4: Check Network and Security Configuration

Step 5: Reinstall the SSM Agent

Architect's Pro Tip

Frequently Asked Questions

How can I prevent this alert in the future?

The instance is in a private subnet. What should I check?

Related AWS Guides

How to Fix AWS AccessDeniedException Error

AWS Application Load Balancer: Fix 503 Service Unavailable due to Target Group Resource Exhaustion

AWS EKS: Fix Intermittent Pod Evictions due to Resource Exhaustion in Multi-Tenant Clusters