AI / Automation May 6, 2026

2026-05-06 OpenClaw dual gateway: ports, OPENCLAW_STATE_DIR, and nginx upstream rollout on a leased Apple Silicon cloud Mac (HK / JP / KR / SG / US)

MacXCode Engineering Team May 6, 2026 ~24 min read

Some teams want two OpenClaw gateways on the same leased Mac mini M4: a canary build beside production, or a staging model allowlist isolated from customer webhooks. Doing this wrong—duplicate ports, shared ~/.openclaw, nginx weights that flip mid-request—creates 502 storms that look like model outages but are pure plumbing. This 2026-05-06 guide sequences non-colliding listen ports, distinct OPENCLAW_STATE_DIR trees, nginx upstream blocks with slow_start, and health probes so you can drain traffic without SSH heroics across Hong Kong, Tokyo, Seoul, Singapore, and the United States. It builds on nginx + webhook ingress, health probes, launchd env precedence, and file-tool discipline when agents read overlapping repos.

When dual gateways are worth the complexity

Prefer a second Mac when budgets allow—see pricing for adding a witness node. If finance insists on one host, dual gateways still help when:

  • Model allowlists differ between regulated and experimental traffic.
  • Webhook signing secrets rotate on different cadences for staging vs prod.
  • CPU pinning is not available, but you still want a blast-radius firewall between canary and prod sessions.
Numbers: keep canary weight at 5–12% of requests for the first 48 hours; raise in 10% steps only when error budgets stay green.

Topology choices: side-by-side vs primary/standby

Side-by-side active/active works when clients can randomize or you control nginx weighting. Primary/standby suits teams that need deterministic debugging—only one gateway owns write-heavy tools against a shared git mirror. Document the choice in the same README where you store launchctl labels.

Pattern Pros Cons
Active / active Throughput; soak testing under real concurrency Requires idempotent webhooks and deduped side effects
Primary / standby Simpler logs; easier rollback Wastes idle CPU unless standby runs synthetic probes

Ports, bind addresses, and launchd labels

Never guess a free port from lsof once—encode it in plist and nginx. Example pattern (adapt to your semver):

export OPENCLAW_GATEWAY_PORT_PRIMARY=18789 export OPENCLAW_GATEWAY_PORT_CANARY=18890

Bind to 127.0.0.1 for both; let nginx own 0.0.0.0:443. If you previously bound the canary to ::1 while nginx used IPv4-only upstreams, you will chase phantom connection resets—standardize address families per egress resilience guidance.

State directories: why sharing ~/.openclaw is forbidden

Each gateway needs its own OPENCLAW_STATE_DIR pointing at a dedicated path (e.g., /var/lib/openclaw-prod vs /var/lib/openclaw-canary). Shared homes cause session file races and poisoned caches when one gateway rotates API keys. Align permissions with the same UID that launchd uses for both jobs only if you also separate directories—never share the sqlite or session roots.

Backup tip: snapshot both state dirs before nginx weight changes—rollback is rsync plus launchctl kickstart, not a reinstall story.

nginx upstream blocks: weights, slow_start, max_fails

Copy your working server block from the reverse proxy guide, then duplicate upstream definitions:

upstream openclaw_primary { server 127.0.0.1:18789 max_fails=3 fail_timeout=20s; } upstream openclaw_canary { server 127.0.0.1:18890 max_fails=1 fail_timeout=10s slow_start=30s; }

Use split_clients or weighted proxy_pass logic consistent with your risk appetite. Keep proxy_read_timeout aligned with longest model round-trip you permit—splitting gateways does not shrink LLM latency.

When you reload nginx, remember worker_connections and keepalive pools are per-worker: doubling upstreams without raising limits can cap throughput during burst webhook replays. Capture stub_status metrics (where permitted) or parse access logs for upstream response times—if p95 jumps more than 35% immediately after reload, you likely starved workers rather than misconfigured OpenClaw. For TLS-heavy workloads, align cipher suites between staging and prod so you do not accidentally enable a slower path only on the canary chain.

Document an explicit rollback nginx tarball beside your OpenClaw state snapshots. Operators who only snapshot Node trees but not nginx frequently rediscover that a two-line typo in map $http_x_forwarded_proto bricks both gateways at once because the parser fails closed.

Health probes: prove both gateways before traffic shifts

Run the same curl matrix from readiness probes against both ports locally, then through nginx after reload. Log HTTP status, upstream connect time, and TLS handshake duration separately so you can tell nginx regressions from OpenClaw regressions.

Eight-step rollout checklist

  1. Snapshot plists, nginx, and both state dirs.
  2. Install second gateway binaries or npm pins identical to primary.
  3. Validate openclaw doctor on each tree independently.
  4. Start canary with weight 0 in nginx (blackhole until ready).
  5. Smoke synthetic webhook replay from staging.
  6. Raise weight to canary slice (5–12%).
  7. Observe error budget for 48h.
  8. Promote or rollback with documented weight steps.

Timeouts, webhook retries, and double-delivery

Dual gateways amplify duplicate delivery risk if upstream providers retry aggressively. Ensure your handler code honors idempotency keys and stores dedupe markers under the correct state dir. Align proxy_read_timeout with your provider’s TCP idle limits—see webhook delivery + signatures for retry tables.

SLO signal Threshold Action
Upstream connect p95 > 120 ms to localhost Pause weight shift; inspect launchd throttling
5xx ratio > 0.5% over 15 minutes Drain canary weight to zero
Queue backlog > 200 pending jobs Add second Mac or reduce concurrency—do not stack third gateway

SLO table: what “healthy dual gateways” means

Treat dual gateways as a mini service mesh: measure each hop independently. If canary violates SLO while primary is green, fix canary config—do not tune nginx blindly.

Publish an internal dashboard with four lines per gateway: process RSS, CPU%, count of in-flight HTTP requests from nginx logs, and count of failed upstream connects. When canary RSS diverges more than 18% from primary at the same traffic weight, you usually have a memory leak in a plugin or mismatched Node semver—pause the rollout and compare npm ls --depth=0 trees from each OPENCLAW_STATE_DIR. Teams that skip this comparison often burn a weekend chasing “LLM quality regression” that was actually a single duplicated skill bundle loading twice on the canary host.

Finally, rehearse total nginx failure quarterly: stop nginx intentionally, confirm launchd still keeps both gateways alive on loopback, and verify your incident bot pages the right humans. The goal is confidence that dual gateways increase availability—not that they double your pager noise.

FAQ: TLS, DNS, and secrets

Question Practical answer (2026-05-06)
Separate TLS certs per gateway? Usually no—terminate once at nginx; gateways stay loopback-only.
Same API keys for both? Avoid—rotate independently to limit blast radius; mirror pattern from upgrade rollback.

Why Mac mini M4 bare metal still simplifies dual-gateway chaos

Dual gateways stress CPU, disk, and launchd bookkeeping simultaneously. A leased Mac mini M4 with 1–2 TB NVMe on MacXCode nodes gives predictable loopback latency for nginx→OpenClaw hops and enough unified memory to run two Node graphs without stealing cores from your nightly xcodebuild neighbor lanes. That predictability is what makes weight-based rollouts measurable instead of mystical. Pair the hardware story with regional pricing when ops asks for a dedicated second node instead of a third gateway on an already busy host, and with help docs for SSH break-glass when VNC is required to approve a new TLS trust anchor.

Add a second builder before a third gateway

HK / JP / KR / SG / US · SSH / optional VNC