AWS Systems Manager: Fix SSM Agent Heartbeat Failures Resulting in Intermittent Timeouts
Quick Fix Summary
TL;DRRestart the SSM Agent service on the affected instance.
The SSM Agent fails to send regular heartbeat signals to the Systems Manager service, causing the instance to appear unreachable and leading to command timeouts.
Diagnosis & Causes
Recovery Steps
Step 1: Verify Agent Status and Connectivity
Check if the agent is running and can reach the SSM service endpoints from the instance.
sudo systemctl status amazon-ssm-agent
sudo /opt/aws/amazon-ssm-agent/bin/amazon-ssm-agent -version
curl -s -o /dev/null -w "%{http_code}" https://ssm.us-east-1.amazonaws.com/ || timeout 3 nc -zv ssm.us-east-1.amazonaws.com 443 Step 2: Restart the SSM Agent Service
Gracefully restart the agent to clear any transient state issues.
sudo systemctl restart amazon-ssm-agent
sleep 10&&sudo systemctl status amazon-ssm-agent Step 3: Check Instance Resource Utilization
Identify if CPU, memory, or disk I/O pressure is causing agent timeouts.
top -b -n 1 | head -20
free -h
df -h / /var/lib/amazon/ssm Step 4: Inspect Agent Logs for Errors
Examine the agent logs for authentication, network, or heartbeat-specific errors.
sudo tail -100 /var/log/amazon/ssm/amazon-ssm-agent.log
sudo grep -i "heartbeat\|error\|failed to update" /var/log/amazon/ssm/amazon-ssm-agent.log | tail -50 Step 5: Verify IAM Instance Profile and Permissions
Confirm the instance's IAM role has the necessary SSM permissions.
aws sts get-caller-identity --region us-east-1
aws iam get-instance-profile --instance-profile-name YourInstanceProfileName --query "InstanceProfile.Roles[0].RoleName" --output text Step 6: Update the SSM Agent to the Latest Version
Install the latest agent version to resolve known bugs.
sudo yum update -y amazon-ssm-agent # For RHEL/Amazon Linux
sudo snap refresh amazon-ssm-agent --classic # For Ubuntu Snap
sudo /opt/aws/amazon-ssm-agent/bin/amazon-ssm-agent -version Step 7: Re-register the Managed Instance (Last Resort)
As a final step, de-register and re-register the instance with Systems Manager. WARNING: This clears all agent state.
sudo systemctl stop amazon-ssm-agent
sudo rm -rf /var/lib/amazon/ssm/registration
sudo systemctl start amazon-ssm-agent Architect's Pro Tip
"Intermittent timeouts are often caused by network ACLs or security groups that allow outbound HTTPS but then silently drop packets due to rate-limiting or stateful inspection timeouts. Test with a sustained curl loop (`for i in {1..30}; do curl -s -o /dev/null -w "%{time_total}\n" https://ssm.region.amazonaws.com/ && sleep 1; done`) to catch periodic packet loss."
Frequently Asked Questions
The agent restarts but fails again after a few minutes. What next?
This strongly points to a resource exhaustion issue. Check for memory leaks using `ps aux | grep ssm-agent` over time and monitor `/var/log/messages` for OOM killer events. Also, verify the agent isn't stuck processing a very large State Manager association or Inventory collection.
How can I prevent this in my Auto Scaling groups?
Use a latest-generation SSM AMI (e.g., Amazon Linux 2023) which has the newest agent pre-installed. In your launch template, add a User Data script to run `yum update -y amazon-ssm-agent` on first boot. Also, ensure your instance IAM role uses the managed policy `AmazonSSMManagedInstanceCore`.