AI / Automation April 25, 2026

2026-04-25 OpenClaw doctor, CLI Config & Model Allowlist Triage After Upgrades on Headless Leased Cloud Mac

MacXCode Engineering Team April 25, 2026 ~17 min read

On a leased Mac mini M4 in HK / JP / KR / SG / US you are usually SSH first, VNC only when a GUI is unavoidable. That is exactly the failure mode the OpenClaw project optimizes for: a long-lived gateway process, a JSON-shaped config, and a rapid npm release cadence where “nothing changed in my git diff” can still break production because a dependency bumped validation rules, moved defaults, or tightened a model allowlist. This 2026-04-25 runbook is a triage path—not a rehash of the egress and DNS guide or the onboarding steps. It lines up: openclaw doctor (and optional --fix), the split between CLI managed keys and hand-edited files, how to prove which ~/.openclaw a LaunchAgent actually read, and how to avoid the classic “works in an interactive ssh -t shell, fails under launchd” trap. Keep npm + double restart in your rollback story when doctor flags stale native modules or duplicated gateways.

Why upgrades surface “ghost” issues on headless Macs

There are three different clocks: your laptop (rich PATH, zshrc, HTTPS_PROXY from a VPN you forgot), the CI-style SSH session your automation uses, and the launchd environment that is the only one that should matter to the daemon. After an npm -g upgrade or a tagged OpenClaw release, schema validation may start rejecting a key you carried since April, the gateway may write a second PID file, or a tool call that used to be implicit may need an explicit approval / capability that only shows up in verbose logs. Treat readiness as a fast filter for the listener on 127.0.0.1:18789, not as a certificate that the whole personality config is still coherent—readiness is necessary, not sufficient for “chat works in every channel.”

Run openclaw doctor (and --fix) as the first 90 seconds

Start with a non-destructive pass, then a fix pass you can roll back: openclaw status --all to print versions, PIDs, listening ports, and a compact matrix of the moving parts; openclaw doctor to validate schema, detect stale or conflicting keys, and (depending on the release) nudge you toward supported paths; openclaw doctor --fix when the CLI can safely remove obvious debris such as a stale ~/.openclaw/gateway.pid after a crash, or a deprecated sub-key that the migrator can rewrite. Document in your internal wiki the exact order you use—teams that mix doctor after a manual kill -9 get different output than those who openclaw gateway stop first. If doctor reports a plugin load failure, correlate with the WebSocket 1006 class of errors mentioned in the ecosystem; do not stop at the first “another gateway is listening” string without checking for legacy clawdbot processes from older installs, as the community has repeatedly documented.

Single source of truth: CLI config vs on-disk JSON

Prefer openclaw config set <key> <value> and openclaw config get <key> (exact surface varies by version) to mutate configuration—many teams hit schema drift by editing ~/.openclaw/openclaw.json with an extra comma or a key that the UI once wrote but the CLI no longer serializes. After any manual edit, run doctor again. Hot reload covers many agent / model / prompt values, but gateway.port and gateway.bind still deserve a full openclaw gateway restart per the upstream hot-reload contract—mirroring what you would do for nginx + webhook when only one half of a split stack changed. Keep secrets in ~/.openclaw/.env and inherit them through the same EnvironmentVariables stanza you already use for API keys in launchd, not from a cat >> ~/.zshrc you ran once during onboarding.

Model allowlist: “not allowed” is often policy, not outage

When a provider renames a model string or your skills reference anthropic/claude-3-5-sonnet-latest while the gateway only allows a pinned list, you get failures that look like credential problems. Inspect the effective allowlist: openclaw config get agents.defaults.models (or the closest equivalent your version exposes) and make the strings identical to the catalog you intend—case, vendor prefix, and slot (primary / fallback) all matter. Pair this with the egress article: if the error is after TLS and HTTP/2, it is a model routing issue; if the error is before TCP completes, you are in DNS/TLS land. If you add cron or sub-agent workloads, they must not inherit a thinner allowlist in a separate profile you forgot to merge. Log the resolved model for each scheduled job, not the human label you typed in Slack.

Gateway, PID file, and “I already killed it”

Two overlapping listeners show up when a user LaunchAgent restarts while a manual foreground process still has the port, or when an upgrade leaves a child Node that reparents to launchd. lsof -iTCP:18789 -sTCP:LISTEN and ps -ax | grep -i openclaw on the same account that owns the plist—not root—are your truth source. If you must nuke, follow the sequenced stop the docs recommend, then re-run doctor before a start. This pairs with the gateway upgrade/rollback matrix: a bad rollout is easier to see when the prior PID file, node version, and DEVELOPER_DIR you use for sidecar native builds are all captured in a ticket template.

Merge-blocking rule: a green openclaw doctor in CI after a staging upgrade on a throwaway leased host, plus a saved status --all log, is better than a screenshot of a chat window that only proves your mobile client can see Telegram.

Symptom / layer table (config vs service vs network)

Symptom Layer Stabilize
“Model not in allowlist” right after a minor bump Config policy Expand allowlist, pin strings, re-run a single chat in debug
Doctor OK, 403 on provider despite fresh API key in shell launchd env / egress Print env from the running PID; then run egress TLS checks
1006 on WebSocket after plugin add Node plugin graph Re-run npm install, read gateway stderr; compare with subagent note

Chain the operational spine: ClawHub skills for what you are allowed to automate, structured logging so a mis-model looks different from a mis-route, and Tailscale when the split between 100.64/ and public hostnames is part of the design. If you are still in bootstrap mode, the onboard + daemon article is the prerequisite.

FAQ: doctor on shared builders

Question Practical answer
Should I run doctor before every deploy? At minimum after npm changes and gateway binary bumps; it is fast compared to a broken weekend chat bridge.
Is manual JSON ever acceptable? When merging upstream examples—immediately follow with doctor and a chat smoke test, not a blind restart.
What if the region in HK lags US? Timeboxed differences are normal; policy or egress differences are not. Split dashboards by lease label, then scale via regional plans.

Why bare-metal M4 in five regions still wins

Diagnostics and gateway workloads care about jitter and repeatable I/O more than peak GFLOPS. A Mac mini M4 on MacXCode gives you a single-tenant NVMe to retain verbose JSON logs, persistent SSH access, and the option to snap a second node when your team splits provider allow-lists by country. If 2026 is the year you pair OpenClaw with real shipping pipelines, the same node family can also host the Xcode test+coverage gate in the other half of the platform runbooks—treat the lease as capacity you can measure with p95, not a black box you reboot until green.

Run OpenClaw on stable, SSH-first Macs

HK · JP · KR · SG · US · Optional VNC