ERROR

AWS Systems Manager: Fix SSM Agent Heartbeat Failures Resulting in Intermittent Timeouts

Quick Fix Summary

TL;DR

Restart the SSM Agent service on the affected instance.

The SSM Agent fails to send regular heartbeat signals to the Systems Manager service, causing the instance to appear unreachable and leading to command timeouts.

Diagnosis & Causes

Network connectivity or proxy issues blocking outbound HTTPS (port 443) traffic to SSM endpoints.

Resource constraints (high CPU/memory) on the instance starving the agent process.

Corrupted agent state or outdated agent version with known bugs.

Recovery Steps

Step 1: Verify Agent Status and Connectivity

Check if the agent is running and can reach the SSM service endpoints from the instance.

bash

sudo systemctl status amazon-ssm-agent
sudo /opt/aws/amazon-ssm-agent/bin/amazon-ssm-agent -version
curl -s -o /dev/null -w "%{http_code}" https://ssm.us-east-1.amazonaws.com/ || timeout 3 nc -zv ssm.us-east-1.amazonaws.com 443

Step 2: Restart the SSM Agent Service

Gracefully restart the agent to clear any transient state issues.

bash

sudo systemctl restart amazon-ssm-agent
sleep 10&&sudo systemctl status amazon-ssm-agent

Step 3: Check Instance Resource Utilization

Identify if CPU, memory, or disk I/O pressure is causing agent timeouts.

bash

top -b -n 1 | head -20
free -h
df -h / /var/lib/amazon/ssm

Step 4: Inspect Agent Logs for Errors

Examine the agent logs for authentication, network, or heartbeat-specific errors.

bash

sudo tail -100 /var/log/amazon/ssm/amazon-ssm-agent.log
sudo grep -i "heartbeat\|error\|failed to update" /var/log/amazon/ssm/amazon-ssm-agent.log | tail -50

Step 5: Verify IAM Instance Profile and Permissions

Confirm the instance's IAM role has the necessary SSM permissions.

bash

aws sts get-caller-identity --region us-east-1
aws iam get-instance-profile --instance-profile-name YourInstanceProfileName --query "InstanceProfile.Roles[0].RoleName" --output text

Step 6: Update the SSM Agent to the Latest Version

Install the latest agent version to resolve known bugs.

bash

sudo yum update -y amazon-ssm-agent   # For RHEL/Amazon Linux
sudo snap refresh amazon-ssm-agent --classic   # For Ubuntu Snap
sudo /opt/aws/amazon-ssm-agent/bin/amazon-ssm-agent -version

Step 7: Re-register the Managed Instance (Last Resort)

As a final step, de-register and re-register the instance with Systems Manager. WARNING: This clears all agent state.

bash

sudo systemctl stop amazon-ssm-agent
sudo rm -rf /var/lib/amazon/ssm/registration
sudo systemctl start amazon-ssm-agent

Architect's Pro Tip

"Intermittent timeouts are often caused by network ACLs or security groups that allow outbound HTTPS but then silently drop packets due to rate-limiting or stateful inspection timeouts. Test with a sustained curl loop (`for i in {1..30}; do curl -s -o /dev/null -w "%{time_total}\n" https://ssm.region.amazonaws.com/ && sleep 1; done`) to catch periodic packet loss."

Frequently Asked Questions

The agent restarts but fails again after a few minutes. What next?

This strongly points to a resource exhaustion issue. Check for memory leaks using `ps aux | grep ssm-agent` over time and monitor `/var/log/messages` for OOM killer events. Also, verify the agent isn't stuck processing a very large State Manager association or Inventory collection.

How can I prevent this in my Auto Scaling groups?

Use a latest-generation SSM AMI (e.g., Amazon Linux 2023) which has the newest agent pre-installed. In your launch template, add a User Data script to run `yum update -y amazon-ssm-agent` on first boot. Also, ensure your instance IAM role uses the managed policy `AmazonSSMManagedInstanceCore`.

Related AWS Guides

AccessDeniedException

AWS Systems Manager: Fix SSM Agent Heartbeat Failures Resulting in Intermittent Timeouts

Quick Fix Summary

Diagnosis & Causes

Recovery Steps

Step 1: Verify Agent Status and Connectivity

Step 2: Restart the SSM Agent Service

Step 3: Check Instance Resource Utilization

Step 4: Inspect Agent Logs for Errors

Step 5: Verify IAM Instance Profile and Permissions

Step 6: Update the SSM Agent to the Latest Version

Step 7: Re-register the Managed Instance (Last Resort)

Architect's Pro Tip

Frequently Asked Questions

The agent restarts but fails again after a few minutes. What next?

How can I prevent this in my Auto Scaling groups?

Related AWS Guides

How to Fix AWS AccessDeniedException Error

AWS Application Load Balancer: Fix 503 Service Unavailable due to Target Group Resource Exhaustion

AWS EKS: Fix Intermittent Pod Evictions due to Resource Exhaustion in Multi-Tenant Clusters