2026-04-28 OpenClaw, launchd KeepAlive, exit codes, ThrottleInterval, and respawn / crash-loop triage on a headless leased cloud Mac (HK / JP / KR / SG / US)
Most OpenClaw production installs on a Mac mini M4 sit behind SSH only, with a reverse proxy on 127.0.0.1 and JSON logs you tail through structured logging. When a gateway process is managed by launchd, a failure mode you will eventually hit is not a mysterious LLM quirk: it is rapid job restart after exit—sometimes benign (you asked launchd to always run), sometimes catastrophic (bad config, missing PATH, or a crash before the first bind). This 2026-04-28 deep dive bridges provider HTTP pressure and webhook backpressure with the local process supervisor: how KeepAlive and ThrottleInterval interact, how to read first-failure lines in StandardError when respawns are noisy, and when to unload the LaunchAgent to reproduce under an interactive shell. It complements the secrets story in environment + launchd and the upgrade flow in openclaw doctor + allowlist.
On a headless cloud Mac, launchd is your init system
Unlike a laptop, you do not have a user staring at the Terminal when npm or openclaw gateway exits. The durable pattern is: install under a dedicated service account, log to rotated files, and let launchd own lifecycle. A common mistake in 2026 is to compare macOS to Linux systemd one-for-one: keys such as KeepAlive and RunAtLoad are powerful but require explicit throttling, or you can accidentally turn a 2-second typo into thousands of execve events per hour—filling ~/Library/Logs and your remote logging shipper. Treat process exit as a first-class signal, not as “it crashed again, shrug.”
- User LaunchAgents —
~/Library/LaunchAgentsfor per-user long-running services; most OpenClaw setups land here for least privilege. - System LaunchDaemons — require admin write and broader blast radius; avoid unless you truly need a root context.
KeepAlive: which flavor did you turn on?
Apple’s property list for KeepAlive can be a boolean, or a dictionary with conditional reasons (network, path, etc.). A boolean true is the sledgehammer: restart whenever the process exits for any reason, which is what you want for a network daemon you expect to run 24/7, but is wrong for a one-shot migration. More nuanced: NetworkState or a path that must exist. When debugging “why is this restarting?”, the first thing to read is: did I ask launchd to keep it alive, or is the app itself exiting and exit ≠ crash? A clean exit 0 in a KeepAlive true world still triggers restart—exactly the behavior you see when a wrapper script exits after the child is detached incorrectly.
ThrottleInterval to the second, the supervisor is the hammer; if it flaps with no steady cadence, suspect unhandled promise rejections or a port bind race, not the LLM.
StandardOut / StandardError paths: make them real files
Headless SSH sessions are not TTYs; stdio to “nowhere” is how you lose the first 200 ms of a crash. Point StandardOutPath and StandardErrorPath at a directory the service user can write, rotate with macOS newsyslog or a simple daily cron job, and never rely on the interactive shell’s scrollback. When correlating with egress and TLS incidents, a single timestamp-aligned slice across stdout, nginx access, and the gateway JSON is worth more than a pretty Grafana chart.
ThrottleInterval: the brake pedal on relaunch
ThrottleInterval is the minimum seconds to wait after exit before a job may be relaunched when launchd is driving respawns. Put 10 if you need calm logs during initial bring-up, 1–2 in steady state when you want the service to recover from transient file-store locks without sleeping operators. Pair it with: launchctl print gui/$(id -u)/com.example.openclaw (user domain, adjust label) to read state and last exit. Document that number in your on-call: “if Throttle=10, expect at most one restart every 10s in worst case,” which is how you set expectations when someone asks for MTTR < 1 min for software but not for infrastructure.
openclaw doctor may pass while launchd is still in a crash loop because the doctor is an interactive, short-lived binary with a full login environment—your plist might omit PATH for node or a global npm prefix. Always diff env | sort vs doctor output.
Exit codes you will actually read in tickets
| Exit / symptom | Likely OpenClaw cause | What to do first |
|---|---|---|
1 (generic) + one stack in stderr |
Config parse, missing OPENCLAW_ var, or bad cwd |
Run the same ProgramArguments in SSH with set -a; source your.env; set +a per secrets pattern. |
2 (misuse) from shell wrapper |
Option parsing or Mac-specific path quoting | Log raw argv in a wrapper, avoid globbing, confirm Node is not Rosetta if you want arm64 (see SSH shell profiles). |
126 / 127 |
Not executable, or node not in plist PATH |
Absolute path to the binary, which -a node in service context. |
| Signal 9 / exit 137 (Linux parlance) — on macOS often plain kill | OOM, manual kill, or a watchdog (rare) — on M4, check memory of LLM + gateway together | Look at memory pressure in Console or reduce concurrency per provider shaping. |
Crash-loop triage: six questions before you change the model
- Is the port in use? Another stray gateway or a zombie listener can make bind fail; see nginx reverse proxy and port alignment.
- Is this exit 0 in a loop? A wrapper that exits 0 can still trigger
KeepAliveif you did notexecinto the child. - Is stderr identical each time? Identical one-line error → deterministic config. Changing stack → race or network flake.
- What changed last? Tie to git tag or
npmshrinkwrap; release discipline culture helps even for non-App-Store services. - Is upstream returning 401 while your retry loop never stops? You may need a circuit breaker, not more
KeepAlivepressure—see 429/503 in LLM rate limits. - Have you throttled the hammer? If not, your first “fix” is often just
ThrottleIntervalto read logs, then a real code/config fix.
log stream --style compact --predicate 'process == "openclaw"' only after you have reduced the restart rate, or the noise will hide the first signal line you needed.
Plist operational checklist (copy into your Confluence / Notion)
- Label — unique reverse-DNS, never reuse another team’s
com.prefix in the sameUID. - ProgramArguments[0] — full path, no “nice” surprises from login shells.
- WorkingDirectory — repo root for config rel paths; matches what
openclaw doctorassumes. - EnvironmentVariables —
HOME,PATH,NODE_OPTIONS(if you must) — align with the secrets doc. - KeepAlive + ThrottleInterval — do not set aggressive restart and a tiny stderr limit without rotation.
plutil -lint ~/Library/LaunchAgents/com.yourorg.openclaw.plist
launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.yourorg.openclaw.plist
Seven ops steps when page fires “OpenClaw down in SG”
- Check edge health (nginx) — 502 vs 504 localizes to upstream vs the gateway; cross-check DNS / TLS if outbound fails.
- Check launchd state —
launchctl printfor the label; read last exit and throttle windows. - Stabilize logs —
ThrottleIntervalif still flapping. - Interactive repro under the same
UIDwithoutlaunchd. - Config diff and secret rotation with two-person review.
- Roll forward to a known-good
npmversion, not a mystery patch on prod. - Post-incident — one paragraph in the ticket: root cause, detection gap, and guard (e.g. exit-code alert if non-zero in a 1-minute window).
FAQ: launchd vs the OpenClaw gateway itself
| Question | 2026-04-28 answer |
|---|---|
Should I set ProcessType to Interactive? |
Usually no for a server daemon; Interactive changes scheduling priorities. Prefer standard background behavior unless Apple docs for your macOS build recommend otherwise for your workload class. |
Is caffeinate required? |
On a leased 24/7 Mac mini M4 in a data hall, sleep is typically disabled at the power policy level; if you are on a laptop-style host (not MacXCode’s case), you still do not “fix” flapping with caffeinate if the true bug is a crash on resume. |
Why Apple Silicon Mac mini M4 still shows up in this runbook
launchd is only as honest as the hardware beneath it. A bare-metal Mac mini M4 in Hong Kong, Tokyo, Seoul, Singapore, or the United States with 1–2 TB and predictable CPU scheduling makes “did we OOM the gateway with too many tool calls?” a measurable question, not a vibe. The same MPS-friendly cores that help with local ML experiments also give you headroom to run structured log forwarders, reverse proxies, and the OpenClaw stack on one billable unit, which is why teams pair these articles with regional node pricing and access guidance when they outgrow a laptop tied to a single human owner.
Run OpenClaw where launchd, logs, and network align
1–2 TB · Apple Silicon · SSH / optional VNC