Is a burst of 429s always a provider problem?

Often you increased concurrency or tool depth after a feature flag flip; confirm local queue and token accounting before opening a support ticket on the model vendor.

Should I retry 503 the same as 429?

503 usually means transient server-side unavailability: shorter first backoff, fewer total attempts, and alert if error rate stays above 2% for more than 5 minutes.

AI / Automation April 27, 2026

2026-04-27 OpenClaw, upstream LLM HTTP 429/503, retry budgets, and per-org request shaping on a headless leased cloud Mac (HK / JP / KR / SG / US)

MacXCode Engineering Team April 27, 2026 ~19 min read

When OpenClaw runs 24/7 on a leased Mac mini M4 that is only visible through SSH or occasional VNC, the loudest “rate limit” symptoms almost always look like a broken model: chat replies stutter, tool calls fail mid-trace, and someone asks whether the API key was rotated. In reality, HTTP 429 and HTTP 503 on the outbound LLM provider path are a different class of incident than a 429 you might emit yourself from nginx on the inbound webhook edge, and they are different again from a dead route in egress / DNS / TLS resilience work. This 2026-04-27 guide is for SREs and staff engineers who need a repeatable split between our abuse of provider quotas vs their service degradation, including numeric backoff defaults, a per-organization retry budget that your gateway cannot exceed without paging someone, and log fields that structured JSON logging can already index today. If the provider is healthy but the gateway still restarts, move one layer down the stack: launchd KeepAlive and respawn triage (2026-04-28).

Two “429” surfaces: your nginx vs their API gateway

During high-noise days—major model releases, regional outages, or a marketing team’s demo script that spawns 40 concurrent skills—it is easy to conflate: (A) a deliberately throttled inbound webhook, (B) an upstream 429/503, and (C) a 401/403 from a mis-keyed or expired credential. A practical triage: look at the source IP in logs (your loopback to 127.0.0.1:18789 for the process vs public edge), the user-agent or library tag the provider echoes, and whether retry-after (when present) is a small number (seconds) vs minutes. 2026-04-27 operators in Hong Kong and Tokyo often need this split because a corporate proxy can 429 a burst while the model vendor is healthy, or the inverse. Tag each log line with edge=inbound|outbound in your log schema and never look back.

Numeric anchor: if your 5xx or 429 outbound ratio exceeds 2% of all LLM round-trips for more than 5 minutes, open a sev-2 bridge—treat that like a network incident, not “AI quality drift.”

HTTP 401/403 vs 429 vs 503/529: what the gateway should do differently

Status	Meaning in practice	Default operator posture
`401` / `403`	Credential, allow-list, or org mismatch—often immediate and deterministic until config changes	Stop auto-retry beyond 0–1 probe; verify key rotation and launchd-stored secrets.
`429`	Provider-side throttle or token / request bucket hit	Backoff with jitter; reduce in-flight work before blaming “OpenClaw bugs.”
`503` / `529` (varies by vendor)	Overloaded or unavailable region / pool	Shorter first delay, fewer max attempts, possibly route to a secondary model if policy allows (see model allowlist triage).

Keep your internal HTTP client settings honest: a naive loop that always retries 401 is how you lock an account under provider-side brute-force heuristics. A loop that never backs off on 503 is how a minor vendor blip becomes a self-DDoS from a busy gateway on a 10 Gbps NIC Mac that can actually sustain ridiculous parallel streams.

Concurrency, token windows, and org-level budgets (not per chat)

Most “we got rate-limited” tickets are self-inflicted concurrency: a cron-style schedule from launchd tasks that overlaps with a burst of Discord or Slack tool invocations, each spawning parallel HTTP calls, each larger than a human would type. Fix the shape of traffic first:

Cap simultaneous tool-using turns to a number you can write in a policy, for example 4 in-flight upstream HTTP requests per gateway process (tune to CPU and Neural Engine offloads if you do local model routing, but the HTTP path still needs a cap).
Size per-organization daily token ceilings in business terms (support vs marketing vs R&D), not only a global MAX_TOKENS in one JSON file.
Time-shift scheduled jobs in your cron/launchd layer to avoid a pile-up on UTC hour boundaries when US and EU workdays overlap in SG and JP regions.

Jitter, exponential cap, and wall-time budgets (three concrete numbers)

Adopt a written policy your on-call can paste into a ticket, not a vibe:

First 429 retry after 0.4–1.2 s of jitter, not a hard 0 s hot loop.
Exponential cap at 32–64 s between attempts, even if a vendor says “come back in two minutes” during incidents—escalate humans for longer gaps.
Hard stop the entire job after 180 s of cumulative retry wait for user-visible chat turns, or you train users to think the gateway is “hung” and they spam channels, which creates another 429 layer.

sleep $((2 + RANDOM % 3)) curl -sS -D - "https://api.example/v1/health" -o /dev/null | head -n1

Separate budgets: the webhook 429/503 retry policy and the LLM policy should be different YAML blocks—if you copy-paste, you will amplify duplicates or starve one side.

Runbook: six steps when charts go yellow

Freeze releases to the openclaw package unless you are rolling back; config-only changes are OK if peer-reviewed in 2 minutes under incident rules.
Classify the edge using the inbound/outbound label in logs; if outbound 429, snapshot current concurrency and per-org limits from your live config, redacting secrets in the copy.
Scale down not up: cut parallel tool runners by 50% and measure if p95 latency improves—if it does, you are queue-starved, not underpowered.
Compare against egress and DNS baselines: if TLS errors rose in the same period, you may be seeing transport failures miscategorized as 503 by clients.
Re-run openclaw doctor in the same environment as the daemon, per the doctor/allowlist triage runbook, to catch stale env or model allowlist issues masquerading as throttles.
Postmortem the numeric trio: (1) peak in-flight, (2) 429 / total request ratio, (3) median tokens per call—if (3) jumped, your prompt template regressed, not the provider’s fault.

What to log so Grafana / Loki can answer the question in one query

At minimum, emit: ts, provider, route (completions vs embeddings), attempt, status, request_id if returned, and ms wall time. If you can add org and surface (chat vs headless run), you can separate customer-visible SLOs from batch work. This extends the readiness and probe work you already have for loopback: nothing stops you from adding a synthetic, low-QPS “canary” completion in maintenance windows, as long as the token line item is pre-approved. Cross-link incidents with stale module upgrades because a Node HTTP stack bug can look like random 5xx if versions disagree across processes.

FAQ: throttles, fairness, and operator trust

Question	Answer in one breath
Is “buy more API quota” the default fix?	Sometimes yes for sustained business growth, but shape traffic first—a misconfigured parallel for-each in a custom skill can burn a fresh quota in hours.
Do we need a second physical Mac in-region?	If CPU is low but HTTP saturation is high, a second node is often cheaper than a larger vendor spend—see regional plans and split orgs or environments.

Pair this with egress DNS/TLS (transport truth), inbound webhooks and retry semantics (a different 429), and the subagent channel triage article when the symptom is “stuck in Discord,” not a literal HTTP 429. For stable env wiring, keep using launchd and API key hygiene.

Why Mac mini M4 in HK / JP / KR / SG / US still makes sense for bursty model traffic

OpenClaw mixes burst CPU (tool orchestration, occasional local embeddings) with all-day NIC use (streaming HTTP/2, repeated TLS handshakes, background heartbeats from bridges). A physical Mac mini M4 keeps those workloads on a predictable stack: no hypervisor lying about timer quality for backoff algorithms, and enough unified memory to hold verbose JSON log buffers without the OOM flaps you get on 8 GB Linux slices. Colocating the gateway in the same metro as your mobile QA and Xcode build fleet also means one vendor relationship for SSH, disk, and support hours—simplify first, then scale to two nodes with evidence.

Run the gateway on hardware that survives retries

1–2 TB · Apple Silicon · SSH and optional VNC

Read operational help & SSH View regional Mac plans

Two “429” surfaces: your nginx vs their API gateway

HTTP 401/403 vs 429 vs 503/529: what the gateway should do differently

Concurrency, token windows, and org-level budgets (not per chat)

Jitter, exponential cap, and wall-time budgets (three concrete numbers)

Runbook: six steps when charts go yellow

What to log so Grafana / Loki can answer the question in one query

FAQ: throttles, fairness, and operator trust

Related runbooks on the same headless node

Why Mac mini M4 in HK / JP / KR / SG / US still makes sense for bursty model traffic

Run the gateway on hardware that survives retries