What does ThrottleInterval do?

It sets a minimum seconds delay between relaunch attempts after a job exits, which reduces tight crash loops and log amplification.

Should I use RunAtLoad and KeepAlive together?

Common for long-running daemons: RunAtLoad starts the job at login or bootstrap time; KeepAlive can restart on unexpected exit, but pair with throttling to avoid unbounded flapping.

AI / Automation April 28, 2026

2026-04-28 OpenClaw, launchd `KeepAlive`, exit codes, `ThrottleInterval`, and respawn / crash-loop triage on a headless leased cloud Mac (HK / JP / KR / SG / US)

MacXCode Engineering Team April 28, 2026 ~20 min read

Most OpenClaw production installs on a Mac mini M4 sit behind SSH only, with a reverse proxy on 127.0.0.1 and JSON logs you tail through structured logging. When a gateway process is managed by launchd, a failure mode you will eventually hit is not a mysterious LLM quirk: it is rapid job restart after exit—sometimes benign (you asked launchd to always run), sometimes catastrophic (bad config, missing PATH, or a crash before the first bind). This 2026-04-28 deep dive bridges provider HTTP pressure and webhook backpressure with the local process supervisor: how KeepAlive and ThrottleInterval interact, how to read first-failure lines in StandardError when respawns are noisy, and when to unload the LaunchAgent to reproduce under an interactive shell. It complements the secrets story in environment + launchd and the upgrade flow in openclaw doctor + allowlist.

On a headless cloud Mac, launchd is your init system

Unlike a laptop, you do not have a user staring at the Terminal when npm or openclaw gateway exits. The durable pattern is: install under a dedicated service account, log to rotated files, and let launchd own lifecycle. A common mistake in 2026 is to compare macOS to Linux systemd one-for-one: keys such as KeepAlive and RunAtLoad are powerful but require explicit throttling, or you can accidentally turn a 2-second typo into thousands of execve events per hour—filling ~/Library/Logs and your remote logging shipper. Treat process exit as a first-class signal, not as “it crashed again, shrug.”

User LaunchAgents — ~/Library/LaunchAgents for per-user long-running services; most OpenClaw setups land here for least privilege.
System LaunchDaemons — require admin write and broader blast radius; avoid unless you truly need a root context.

KeepAlive: which flavor did you turn on?

Apple’s property list for KeepAlive can be a boolean, or a dictionary with conditional reasons (network, path, etc.). A boolean true is the sledgehammer: restart whenever the process exits for any reason, which is what you want for a network daemon you expect to run 24/7, but is wrong for a one-shot migration. More nuanced: NetworkState or a path that must exist. When debugging “why is this restarting?”, the first thing to read is: did I ask launchd to keep it alive, or is the app itself exiting and exit ≠ crash? A clean exit 0 in a KeepAlive true world still triggers restart—exactly the behavior you see when a wrapper script exits after the child is detached incorrectly.

Heuristic (2026-04-28): if the service restarts on an interval that matches or exceeds your ThrottleInterval to the second, the supervisor is the hammer; if it flaps with no steady cadence, suspect unhandled promise rejections or a port bind race, not the LLM.

StandardOut / StandardError paths: make them real files

Headless SSH sessions are not TTYs; stdio to “nowhere” is how you lose the first 200 ms of a crash. Point StandardOutPath and StandardErrorPath at a directory the service user can write, rotate with macOS newsyslog or a simple daily cron job, and never rely on the interactive shell’s scrollback. When correlating with egress and TLS incidents, a single timestamp-aligned slice across stdout, nginx access, and the gateway JSON is worth more than a pretty Grafana chart.

ThrottleInterval: the brake pedal on relaunch

ThrottleInterval is the minimum seconds to wait after exit before a job may be relaunched when launchd is driving respawns. Put 10 if you need calm logs during initial bring-up, 1–2 in steady state when you want the service to recover from transient file-store locks without sleeping operators. Pair it with: launchctl print gui/$(id -u)/com.example.openclaw (user domain, adjust label) to read state and last exit. Document that number in your on-call: “if Throttle=10, expect at most one restart every 10s in worst case,” which is how you set expectations when someone asks for MTTR < 1 min for software but not for infrastructure.

openclaw doctor may pass while launchd is still in a crash loop because the doctor is an interactive, short-lived binary with a full login environment—your plist might omit PATH for node or a global npm prefix. Always diff env | sort vs doctor output.

Exit codes you will actually read in tickets

Exit / symptom	Likely OpenClaw cause	What to do first
`1` (generic) + one stack in stderr	Config parse, missing `OPENCLAW_` var, or bad `cwd`	Run the same `ProgramArguments` in SSH with `set -a; source your.env; set +a` per secrets pattern.
`2` (misuse) from shell wrapper	Option parsing or Mac-specific path quoting	Log raw argv in a wrapper, avoid globbing, confirm Node is not Rosetta if you want arm64 (see SSH shell profiles).
`126` / `127`	Not executable, or `node` not in plist `PATH`	Absolute path to the binary, `which -a node` in service context.
Signal 9 / exit 137 (Linux parlance) — on macOS often plain kill	OOM, manual kill, or a watchdog (rare) — on M4, check memory of LLM + gateway together	Look at `memory pressure` in Console or reduce concurrency per provider shaping.

Crash-loop triage: six questions before you change the model

Is the port in use? Another stray gateway or a zombie listener can make bind fail; see nginx reverse proxy and port alignment.
Is this exit 0 in a loop? A wrapper that exits 0 can still trigger KeepAlive if you did not exec into the child.
Is stderr identical each time? Identical one-line error → deterministic config. Changing stack → race or network flake.
What changed last? Tie to git tag or npm shrinkwrap; release discipline culture helps even for non-App-Store services.
Is upstream returning 401 while your retry loop never stops? You may need a circuit breaker, not more KeepAlive pressure—see 429/503 in LLM rate limits.
Have you throttled the hammer? If not, your first “fix” is often just ThrottleInterval to read logs, then a real code/config fix.

Log hygiene: in crash loops, prefer log stream --style compact --predicate 'process == "openclaw"' only after you have reduced the restart rate, or the noise will hide the first signal line you needed.

Plist operational checklist (copy into your Confluence / Notion)

Label — unique reverse-DNS, never reuse another team’s com. prefix in the same UID.
ProgramArguments[0] — full path, no “nice” surprises from login shells.
WorkingDirectory — repo root for config rel paths; matches what openclaw doctor assumes.
EnvironmentVariables — HOME, PATH, NODE_OPTIONS (if you must) — align with the secrets doc.
KeepAlive + ThrottleInterval — do not set aggressive restart and a tiny stderr limit without rotation.

plutil -lint ~/Library/LaunchAgents/com.yourorg.openclaw.plist launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.yourorg.openclaw.plist

Seven ops steps when page fires “OpenClaw down in SG”

Check edge health (nginx) — 502 vs 504 localizes to upstream vs the gateway; cross-check DNS / TLS if outbound fails.
Check launchd state — launchctl print for the label; read last exit and throttle windows.
Stabilize logs — ThrottleInterval if still flapping.
Interactive repro under the same UID without launchd.
Config diff and secret rotation with two-person review.
Roll forward to a known-good npm version, not a mystery patch on prod.
Post-incident — one paragraph in the ticket: root cause, detection gap, and guard (e.g. exit-code alert if non-zero in a 1-minute window).

FAQ: launchd vs the OpenClaw gateway itself

Question	2026-04-28 answer
Should I set `ProcessType` to Interactive?	Usually no for a server daemon; Interactive changes scheduling priorities. Prefer standard background behavior unless Apple docs for your macOS build recommend otherwise for your workload class.
Is `caffeinate` required?	On a leased 24/7 Mac mini M4 in a data hall, sleep is typically disabled at the power policy level; if you are on a laptop-style host (not MacXCode’s case), you still do not “fix” flapping with `caffeinate` if the true bug is a crash on resume.

Why Apple Silicon Mac mini M4 still shows up in this runbook

launchd is only as honest as the hardware beneath it. A bare-metal Mac mini M4 in Hong Kong, Tokyo, Seoul, Singapore, or the United States with 1–2 TB and predictable CPU scheduling makes “did we OOM the gateway with too many tool calls?” a measurable question, not a vibe. The same MPS-friendly cores that help with local ML experiments also give you headroom to run structured log forwarders, reverse proxies, and the OpenClaw stack on one billable unit, which is why teams pair these articles with regional node pricing and access guidance when they outgrow a laptop tied to a single human owner.

Run OpenClaw where launchd, logs, and network align

1–2 TB · Apple Silicon · SSH / optional VNC

Compare regional nodes Read help & access guide