CRITICAL

Root Cause Analysis: Why Redis Cluster Fails Over in High Throughput Scenarios

Quick Fix Summary

TL;DR

Increase `cluster-node-timeout` and `repl-timeout` values, then restart the failing master node.

The MISCONF/NOREPLICAS error occurs when a Redis master node fails to persist data to disk within the cluster's timeout window, triggering a failover. This is often a race condition between high write throughput, disk I/O latency, and strict cluster timeout configurations.

Diagnosis & Causes

  • Disk I/O saturation delaying RDB/AOF persistence.
  • Insufficient `cluster-node-timeout` for high write loads.
  • Network latency between master and replicas exceeding `repl-timeout`.
  • Background save (bgsave) blocking the main thread for too long.
  • Memory pressure causing fork() delays during persistence.
  • Recovery Steps

    1

    Step 1: Diagnose the Failing Node's State

    First, connect to the suspected failing master node and check its logs, memory, and persistence status to confirm the root cause.

    bash
    redis-cli -h <failing-node-ip> -p <port> INFO persistence
    redis-cli -h <failing-node-ip> -p <port> INFO memory
    tail -100 /var/log/redis/redis-server.log | grep -E "(MISCONF|NOREPLICAS|Background saving|Fork)"
    2

    Step 2: Adjust Critical Timeout Configurations

    Increase timeout parameters in the redis.conf file on all cluster nodes to accommodate slower persistence operations under load. This is the primary fix.

    bash
    # On each Redis node, edit /etc/redis/redis.conf
    cluster-node-timeout 30000
    repl-timeout 120
    repl-backlog-size 256mb
    # Restart Redis after changes
    sudo systemctl restart redis-server
    3

    Step 3: Optimize Persistence for High Throughput

    Tune RDB and AOF settings to reduce the performance impact of persistence, which is often the blocking operation.

    bash
    # In redis.conf
    save 900 1
    save 300 100
    save 60 10000
    appendfsync everysec
    no-appendfsync-on-rewrite yes
    auto-aof-rewrite-percentage 80
    auto-aof-rewrite-min-size 256mb
    4

    Step 4: Monitor and Alert on Key Metrics

    Implement monitoring to catch timeout risks before they cause a failover. Track persistence duration and replica lag.

    bash
    # Example Prometheus queries via Redis Exporter
    # Time since last successful save
    redis_rdb_last_save_timestamp_seconds - redis_rdb_last_bgsave_time_sec > 30
    # Replica lag in seconds
    redis_master_repl_offset - redis_slave_repl_offset
    5

    Step 5: Validate Cluster Health Post-Change

    After applying configuration changes, verify the cluster state is stable and all nodes are properly synchronized.

    bash
    redis-cli --cluster check <any-cluster-node-ip>:<port>
    redis-cli -h <node-ip> -p <port> CLUSTER NODES | grep -v "fail"
    # Check for consistent `connected` state and correct role assignments

    Architect's Pro Tip

    "The race condition often happens during `bgsave`. Monitor `latest_fork_usec` in INFO stats; if it's consistently high (>1000ms), your `cluster-node-timeout` is likely too low for your hardware."

    Frequently Asked Questions

    Is it safe to increase `cluster-node-timeout` to a very high value?

    No. Excessively high values (e.g., > 60s) can make the cluster slow to react to genuine node failures. The goal is to find a balance that accommodates persistence spikes (e.g., 20-30s) while maintaining reasonable failover times.

    Can this happen with AOF-only persistence?

    Yes. While AOF `appendfsync everysec` is less blocking than an RDB `bgsave`, a slow disk can still cause sync delays. The `aof-rewrite` process also uses fork(), which can trigger the same timeout issue under memory pressure.

    Should I disable persistence to prevent this?

    Never in production. Disabling persistence trades data durability for availability. The correct solution is to tune timeouts and optimize disk I/O (e.g., using SSDs, separate volumes) to keep persistence within the timeout window.

    Related Redis Guides