CRITICAL

Root Cause Analysis: Why Alibaba Cloud ECS Disk Full Errors Happen

Quick Fix Summary

TL;DR

Run 'sudo du -sh /* 2>/dev/null | sort -rh | head -20' to identify largest directories, then clean or expand storage.

Alibaba Cloud ECS DiskFull errors occur when the filesystem reaches 100% capacity, blocking write operations and potentially crashing applications. This is a critical infrastructure failure requiring immediate investigation of storage consumption patterns.

Diagnosis & Causes

  • Unmanaged application log file growth
  • Temporary files not being cleaned automatically
  • Database transaction logs or binary logs filling up
  • Container or image storage accumulation
  • Backup files consuming primary disk space
  • Recovery Steps

    1

    Step 1: Immediate Disk Space Analysis

    Identify which directories and files are consuming the most space using standard Linux utilities.

    bash
    df -h
    sudo du -sh /* 2>/dev/null | sort -rh | head -20
    sudo lsof +L1
    2

    Step 2: Clean Common Temporary and Log Files

    Safely remove temporary files, old logs, and package cache without breaking system functionality.

    bash
    sudo journalctl --vacuum-time=3d
    sudo rm -rf /tmp/*
    sudo apt-get clean || sudo yum clean all
    sudo find /var/log -name "*.log" -mtime +7 -delete
    3

    Step 3: Investigate Application-Specific Storage

    Check database logs, container storage, and application caches that often grow unexpectedly.

    bash
    sudo du -sh /var/lib/docker/* 2>/dev/null
    sudo du -sh /var/lib/mysql/* 2>/dev/null
    sudo find /home -name "core" -type f -delete
    4

    Step 4: Configure Alibaba Cloud Monitoring and Auto-Scaling

    Set up CloudMonitor alerts and auto-scaling policies to prevent future disk full scenarios.

    bash
    # Create CloudMonitor rule for disk usage
    aliyun cms PutGroupMetricRule \
      --RuleName disk_usage_alert \
      --Namespace acs_ecs_dashboard \
      --MetricName disk_utilization \
      --Dimensions '[{"instanceId":"YOUR_INSTANCE_ID"}]' \
      --Statistics Average \
      --ComparisonOperator >= \
      --Threshold 85 \
      --Period 60 \
      --EvaluationCount 2
    5

    Step 5: Implement Preventive Log Rotation

    Configure logrotate to automatically manage log file growth for system and application logs.

    bash
    sudo nano /etc/logrotate.d/myapp
    /var/log/myapp/*.log {
        daily
        rotate 7
        compress
        delaycompress
        missingok
        notifempty
        create 644 root root
    }
    6

    Step 6: Expand Disk Capacity (If Needed)

    Resize the system disk or add a data disk through Alibaba Cloud Console or API.

    bash
    # Resize disk via Alibaba Cloud CLI
    aliyun ecs ResizeDisk \
      --DiskId d-1234567890 \
      --NewSize 100
    # After resizing in console, extend filesystem
    sudo growpart /dev/vda 1
    sudo resize2fs /dev/vda1

    Architect's Pro Tip

    "Check for deleted files still held open by processes using 'lsof +L1'. A service restart may immediately free significant space without file deletion."

    Frequently Asked Questions

    Why does 'df' show 100% usage but 'du' shows less total space used?

    This indicates deleted files are still held open by running processes. Use 'sudo lsof +L1' to identify these processes and restart them to reclaim space.

    How can I prevent DiskFull errors in Kubernetes pods on ECS?

    Configure pod resource limits with ephemeral storage requests/limits, and set up EmptyDir size limits or use Alibaba Cloud NAS for persistent storage.

    What's the safest way to clean disk space without breaking production systems?

    Always analyze with 'du' first, target application logs and temp directories, avoid removing system libraries, and test commands in staging before production.

    Related Alibaba Cloud Guides