ERROR

Debugging Intermittent 504s: RBAC RoleBinding Mismatch Causing ServiceAccount AuthZ Timeouts

Quick Fix Summary

TL;DR

Check and fix RoleBinding namespace references to match the ServiceAccount's namespace.

Intermittent 504s occur when a ServiceAccount's token is used for API requests, but the associated RoleBinding references a Role/ClusterRole in a different namespace, causing the Kubernetes API server to time out during authorization checks.

Diagnosis & Causes

  • RoleBinding's `namespace` field incorrectly set or omitted when referencing a ClusterRole.
  • ServiceAccount and RoleBinding exist in different namespaces.
  • Recovery Steps

    1

    Step 1: Verify the Mismatch

    Identify the problematic ServiceAccount and its associated RoleBindings. Look for bindings that reference ClusterRoles without specifying the correct namespace for the subject.

    bash
    # Get all RoleBindings and examine their subjects and roleRef
    kubectl get rolebindings,clusterrolebindings -A -o yaml | grep -A 5 -B 5 "<your-serviceaccount-name>"
    # Check a specific ServiceAccount's tokens and bound roles
    kubectl describe serviceaccount <sa-name> -n <namespace>
    2

    Step 2: Inspect Specific RoleBinding Configuration

    Examine the YAML of RoleBindings in the application's namespace. The critical issue is a RoleBinding that binds a ClusterRole to a namespaced ServiceAccount but has an incorrect or missing `namespace` field in the `roleRef`.

    bash
    kubectl get rolebinding <binding-name> -n <app-namespace> -o yaml
    3

    Step 3: Correct the RoleBinding

    Update the RoleBinding to properly reference the ClusterRole. For a namespaced RoleBinding, the `roleRef` should point to the ClusterRole by name, and the binding itself provides the namespace context for the subject.

    bash
    # Correct the RoleBinding. Ensure `roleRef` is a ClusterRole and `subjects` include the namespace.
    kubectl edit rolebinding <binding-name> -n <app-namespace>
    # Example correct snippet within the YAML:
    # roleRef:
    #   apiGroup: rbac.authorization.k8s.io
    #   kind: ClusterRole
    #   name: my-cluster-role
    # subjects:
    # - kind: ServiceAccount
    #   name: my-service-account
    #   namespace: <app-namespace>
    4

    Step 4: Check for Overly Broad ClusterRoleBindings

    A ClusterRoleBinding granting permissions cluster-wide can cause unexpected behavior but is not the direct cause of a timeout. Verify if a more restrictive, namespaced RoleBinding is needed instead.

    bash
    kubectl get clusterrolebinding -o yaml | grep -A 10 -B 5 "<your-serviceaccount-name>"
    5

    Step 5: Review API Server and Kube-Apiserver Logs

    Search for authorization timeout or denial messages related to the ServiceAccount. This confirms the AuthZ path is the bottleneck.

    bash
    # On the control plane node(s)
    sudo journalctl -u kube-apiserver --since "5 minutes ago" | grep -i "timeout\|forbidden\|<serviceaccount-uuid>"
    # Or from the pod logs if using a pod-based API server
    kubectl logs -n kube-system kube-apiserver-<node-name> --since=5m | grep -i "authorization"
    6

    Step 6: Validate the Fix

    Impersonate the ServiceAccount and attempt a forbidden API call to verify permissions are now correctly granted without delay.

    bash
    kubectl auth can-i get pods --as=system:serviceaccount:<namespace>:<sa-name> -n <namespace>
    # Simulate an actual call with impersonation and timeout flags
    kubectl get pods -n <namespace> --as=system:serviceaccount:<namespace>:<sa-name> --request-timeout=5s

    Architect's Pro Tip

    "This often happens during Helm chart deployments where the `.Release.Namespace` variable is misused in the RoleBinding's `roleRef` or `subjects` block, or when copying RoleBinding manifests between environments without updating namespace references."

    Frequently Asked Questions

    Why are the 504s intermittent and not constant?

    The Kubernetes API server's authorization webhook cache. A denied request may be cached briefly. Subsequent requests hit the cache (fast fail), but when the cache expires, the request triggers a full, slow authorization check against the misconfigured RBAC rule, causing a timeout.

    What's the difference between a RoleBinding and a ClusterRoleBinding in this context?

    A RoleBinding grants permissions within a specific namespace. A ClusterRoleBinding grants permissions cluster-wide. The bug occurs when a RoleBinding tries to reference a ClusterRole but the binding's inherent namespace context conflicts with the ServiceAccount's intended scope, causing the API server to search incorrectly during authorization.

    Related Kubernetes Guides