CRITICAL

Root Cause Analysis: Why Alibaba Cloud ECS Disk Full Errors Happen

Quick Fix Summary

TL;DR

Run 'sudo du -sh /* 2>/dev/null | sort -rh | head -20' to identify largest directories, then clean or expand storage.

Alibaba Cloud ECS DiskFull errors occur when the filesystem reaches 100% capacity, blocking write operations and potentially crashing applications. This is a critical infrastructure failure requiring immediate investigation of storage consumption patterns.

Diagnosis & Causes

Unmanaged application log file growth

Temporary files not being cleaned automatically

Database transaction logs or binary logs filling up

Container or image storage accumulation

Backup files consuming primary disk space

Recovery Steps

Step 1: Immediate Disk Space Analysis

Identify which directories and files are consuming the most space using standard Linux utilities.

bash

df -h
sudo du -sh /* 2>/dev/null | sort -rh | head -20
sudo lsof +L1

Step 2: Clean Common Temporary and Log Files

Safely remove temporary files, old logs, and package cache without breaking system functionality.

bash

sudo journalctl --vacuum-time=3d
sudo rm -rf /tmp/*
sudo apt-get clean || sudo yum clean all
sudo find /var/log -name "*.log" -mtime +7 -delete

Step 3: Investigate Application-Specific Storage

Check database logs, container storage, and application caches that often grow unexpectedly.

bash

sudo du -sh /var/lib/docker/* 2>/dev/null
sudo du -sh /var/lib/mysql/* 2>/dev/null
sudo find /home -name "core" -type f -delete

Step 4: Configure Alibaba Cloud Monitoring and Auto-Scaling

Set up CloudMonitor alerts and auto-scaling policies to prevent future disk full scenarios.

bash

# Create CloudMonitor rule for disk usage
aliyun cms PutGroupMetricRule \
  --RuleName disk_usage_alert \
  --Namespace acs_ecs_dashboard \
  --MetricName disk_utilization \
  --Dimensions '[{"instanceId":"YOUR_INSTANCE_ID"}]' \
  --Statistics Average \
  --ComparisonOperator >= \
  --Threshold 85 \
  --Period 60 \
  --EvaluationCount 2

Step 5: Implement Preventive Log Rotation

Configure logrotate to automatically manage log file growth for system and application logs.

bash

sudo nano /etc/logrotate.d/myapp
/var/log/myapp/*.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    create 644 root root
}

Step 6: Expand Disk Capacity (If Needed)

Resize the system disk or add a data disk through Alibaba Cloud Console or API.

bash

# Resize disk via Alibaba Cloud CLI
aliyun ecs ResizeDisk \
  --DiskId d-1234567890 \
  --NewSize 100
# After resizing in console, extend filesystem
sudo growpart /dev/vda 1
sudo resize2fs /dev/vda1

Architect's Pro Tip

"Check for deleted files still held open by processes using 'lsof +L1'. A service restart may immediately free significant space without file deletion."

Frequently Asked Questions

Why does 'df' show 100% usage but 'du' shows less total space used?

This indicates deleted files are still held open by running processes. Use 'sudo lsof +L1' to identify these processes and restart them to reclaim space.

How can I prevent DiskFull errors in Kubernetes pods on ECS?

Configure pod resource limits with ephemeral storage requests/limits, and set up EmptyDir size limits or use Alibaba Cloud NAS for persistent storage.

What's the safest way to clean disk space without breaking production systems?

Always analyze with 'du' first, target application logs and temp directories, avoid removing system libraries, and test commands in staging before production.

Related Alibaba Cloud Guides

ACK-NodePool-ErrImagePull

Root Cause Analysis: Why Alibaba Cloud ECS Disk Full Errors Happen

Quick Fix Summary

Diagnosis & Causes

Recovery Steps

Step 1: Immediate Disk Space Analysis

Step 2: Clean Common Temporary and Log Files

Step 3: Investigate Application-Specific Storage

Step 4: Configure Alibaba Cloud Monitoring and Auto-Scaling

Step 5: Implement Preventive Log Rotation

Step 6: Expand Disk Capacity (If Needed)

Architect's Pro Tip

Frequently Asked Questions

Why does 'df' show 100% usage but 'du' shows less total space used?

How can I prevent DiskFull errors in Kubernetes pods on ECS?

What's the safest way to clean disk space without breaking production systems?

Related Alibaba Cloud Guides

Fix Alibaba Cloud ACK NodePool ErrImagePull After K8s Version Upgrade

Troubleshooting Hybrid Cloud Access: CDN 403 Forbidden Errors When Serving Content from On-Premises OSS Origin

How to Fix Alibaba Cloud InvalidAccessKeyId.NotFound