GCP IAM & VPC: Fix 403 Permission Denied for VPC Flow Logs due to Compute Engine OOM
Quick Fix Summary
TL;DRRestart the Compute Engine instance and verify IAM permissions for the service account.
A 403 error for VPC Flow Logs can occur when a Compute Engine instance runs out of memory (OOM), causing the metadata server or agent to fail, which disrupts IAM token acquisition for the logging service account.
Diagnosis & Causes
Recovery Steps
Step 1: Verify Instance Health and Logs
Check the instance's system logs for OOM killer messages and confirm the metadata server is reachable.
gcloud compute instances get-serial-port-output INSTANCE_NAME --zone ZONE
gcloud logging read 'resource.type="gce_instance" resource.labels.instance_id="INSTANCE_ID"' --limit=10 --format="json" Step 2: Check IAM Permissions for Service Account
Verify the service account attached to the instance has the `roles/logging.logWriter` role.
gcloud projects get-iam-policy PROJECT_ID --flatten="bindings[].members" --format="table(bindings.role)" --filter="bindings.members:SERVICE_ACCOUNT_EMAIL" Step 3: Restart the Compute Engine Instance
Immediate recovery step to clear the OOM state and restore the metadata server/agent.
gcloud compute instances reset INSTANCE_NAME --zone ZONE Step 4: Grant Required IAM Role
If permissions are missing, grant the Logs Writer role to the instance's service account.
gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:SERVICE_ACCOUNT_EMAIL" --role="roles/logging.logWriter" Step 5: Resize Instance or Optimize Workload
Prevent recurrence by increasing machine memory or optimizing application memory usage.
gcloud compute instances set-machine-type INSTANCE_NAME --zone ZONE --machine-type NEW_MACHINE_TYPE Step 6: Verify VPC Flow Logs are Flowing
Confirm logs are being written to the configured sink after remediation.
gcloud logging read 'logName="projects/PROJECT_ID/logs/compute.googleapis.com%2Fvpc_flows"' --limit=5 --format="table(timestamp,jsonPayload.connection.src_ip)" Architect's Pro Tip
"This often happens during application deployment spikes or memory leaks. Monitor the instance's memory usage via Cloud Monitoring and set alerts for sustained high usage (>85%) to act before OOM."
Frequently Asked Questions
Why would an OOM crash cause an IAM 403 error?
The OOM killer may terminate the Google Guest Agent or metadata server process. The agent is responsible for fetching and refreshing the IAM access token for the instance's service account. Without a valid token, API calls (like writing logs) are denied.
Can I just add the IAM role without restarting the instance?
No. If the OOM has crashed the token management process, the permission fix alone won't work. The instance must be restarted to restore the agent and fetch a new token with the updated permissions.