ERROR

Troubleshooting AWS SSM Agent Connection Failures Triggering EC2 Monitoring Alerts

Quick Fix Summary

TL;DR

Restart the SSM Agent service on the affected EC2 instance.

The SSM Agent is not communicating with the AWS SSM service, causing health checks to fail and triggering CloudWatch alarms.

Diagnosis & Causes

  • Missing or incorrect IAM instance profile permissions.
  • SSM Agent service is stopped or crashed on the instance.
  • Network connectivity issues (e.g., security groups, NACLs, VPC endpoints).
  • Recovery Steps

    1

    Step 1: Verify SSM Agent Status and Connectivity

    Check if the SSM Agent process is running and can reach the SSM service endpoints.

    bash
    # Check SSM Agent service status
    sudo systemctl status amazon-ssm-agent
    # Check for agent process
    ps aux | grep -i amazon-ssm-agent
    # Test connectivity to SSM endpoints (replace region)
    nc -zv ssm.us-east-1.amazonaws.com 443
    2

    Step 2: Restart and Re-register the SSM Agent

    Restart the agent service. If the issue persists, force a re-registration with the SSM service.

    bash
    # Restart the SSM Agent service
    sudo systemctl restart amazon-ssm-agent
    # Force agent re-registration (if restart fails)
    sudo /opt/aws/amazon-ssm-agent/bin/amazon-ssm-agent -register -y -region "us-east-1" -i "i-1234567890abcdef0"
    3

    Step 3: Validate IAM Instance Profile Permissions

    Ensure the EC2 instance's IAM role has the necessary SSM managed policy attached.

    bash
    # Describe the IAM instance profile attached to the instance (from AWS CLI)
    aws ec2 describe-instances --instance-ids i-1234567890abcdef0 --query 'Reservations[0].Instances[0].IamInstanceProfile'
    # Check attached policies for the IAM role (replace RoleName)
    aws iam list-attached-role-policies --role-name MyEC2SSMRole
    4

    Step 4: Check Network and Security Configuration

    Verify that the instance's security group allows outbound HTTPS (443) traffic and that VPC endpoints (if used) are correctly configured.

    bash
    # Describe security groups for the instance
    aws ec2 describe-instances --instance-ids i-1234567890abcdef0 --query 'Reservations[0].Instances[0].SecurityGroups'
    # Check VPC Endpoint status (if using Interface endpoints)
    aws ec2 describe-vpc-endpoints --filters "Name=vpc-endpoint-type,Values=Interface" "Name=service-name,Values=com.amazonaws.us-east-1.ssm"
    5

    Step 5: Reinstall the SSM Agent

    As a last resort, reinstall the latest version of the SSM Agent.

    bash
    # For Amazon Linux 2 / RHEL / CentOS
    sudo yum remove -y amazon-ssm-agent
    sudo yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
    sudo systemctl start amazon-ssm-agent

    Architect's Pro Tip

    "This often happens after an instance is stopped/started or its IAM role is modified. The agent's internal registration can become stale. A restart (Step 2) usually clears the state."

    Frequently Asked Questions

    How can I prevent this alert in the future?

    Implement a CloudWatch alarm based on the SSM Agent heartbeat metric (`AWS/SSM/AgentHeartbeat`) instead of generic instance status checks. Also, ensure your EC2 launch templates/AMIs have the latest SSM Agent pre-installed and use an IAM role with the `AmazonSSMManagedInstanceCore` policy.

    The instance is in a private subnet. What should I check?

    Verify that VPC endpoints for SSM (`com.amazonaws.region.ssm`), EC2 Messages (`ec2messages`), and SSM Messages (`ssmmessages`) are created in the VPC and that the route tables for the private subnet direct traffic to these endpoints. The security group attached to the endpoints must allow inbound TCP 443 from the instance's security group.

    Related AWS Guides