Is TCP open on 18789 enough for readiness?

No—readiness should validate application semantics: HTTP status, JSON schema, or a gateway subcommand that exercises model routing configuration.

How often should synthetic probes run?

Every 30–120 seconds for liveness; deeper checks that call upstream LLMs should be less frequent to avoid burning tokens—use a canary job every 15 minutes instead.

What log path should operators tail first?

Follow structured logging guidance under ~/.openclaw/logs; correlate with nginx access logs when using a reverse proxy.

AI / Automation April 14, 2026

2026 OpenClaw Health Probes & Readiness Checks on Production Leased Cloud Mac

MacXCode Engineering Team April 14, 2026 ~11 min read

Running OpenClaw 24/7 on a leased Mac mini M4 in Hong Kong, Japan, Korea, Singapore, or the United States means your gateway on 127.0.0.1:18789 becomes part of production infrastructure. Kubernetes users already speak liveness vs readiness; macOS + launchd shops need the same discipline without a kubelet. This 2026 guide defines which signals to scrape, a probe comparison table, a six-step runbook, and alert thresholds that avoid both silent failure and pager fatigue. Pair it with gateway troubleshooting, structured logging, nginx ingress for webhooks, and Tailscale mesh access when failures span network and process layers.

Why “Process Running” Is Not a Health Check

launchd can report 0 exit while the gateway is wedged: TLS context stale, model provider DNS flapping, or partial config writes under ~/.openclaw. Good probes exercise the same code paths customer traffic hits—HTTP handlers, auth middleware, and optional downstream model pings—without hammering paid APIs.

Liveness answers “should we restart the gateway?” — cheap, every 60 seconds.
Readiness answers “should the load balancer send traffic?” — stricter, may include dependency checks.
Canary sends a synthetic user message every 15 minutes to catch subtle regressions; budget tokens explicitly.

Golden rule: never point external monitors directly at 18789 on the public internet—terminate TLS on nginx or keep checks inside your tailnet per Tailscale ACLs.

Signals Operators Should Graph Before On-Call Week

Minimum viable dashboards for MacXCode customers running agents in production:

Request rate + p95 latency from nginx $request_time when a reverse proxy fronts the gateway.
Error ratio — count 5xx divided by total; alert when > 2% for 5 minutes after excluding known maintenance windows.
CPU sustained > 85% for 10 minutes — often precedes thermal throttle on small instances; M4 rarely thermal limits but bursty embeddings can spike.
Disk free < 12 GB on the root APFS volume — log rotation under ~/.openclaw/logs stalls when the filesystem is tight.

Probe Types: What Each Proves

Probe	Proves	Cost / risk
TCP connect to `127.0.0.1:18789`	Socket accept loop alive	Low signal; misses auth failures
HTTP GET `/health` (path per build)	HTTP stack + config load	Preferred baseline liveness
Authenticated synthetic chat	Model routing + credentials	Consumes tokens; run as canary
Disk inode + free space	Log rotation health	Cheap host-level guardrail

Six-Step Runbook: From Nothing to PagerDuty-Ready

Baseline — capture openclaw gateway status output after clean boot; store in git.
Author probe script — curl with --fail and connect-timeout 3 seconds; exit non-zero on failure.
launchd plist — StartInterval 60; ThrottleInterval to avoid storms; log to unified file.
Correlation IDs — echo ISO8601 timestamp per check into logs for cross-search with nginx.
Wire alerts — three consecutive failures page; single failure posts to Slack only.
Game day — quarterly drill: kill gateway intentionally, measure MTTR against 15 minute SLO.

curl -fsS --max-time 3 http://127.0.0.1:18789/health || exit 1

How Probes Compose with Nginx and Tailscale

When nginx terminates TLS, run liveness against the internal URL to isolate edge misconfig from gateway bugs. For tailnet-only deployments, run synthetics from a tagged probe device in Tailscale so ACL changes do not silently disable monitors.

Alerting Thresholds That Avoid Noise

Condition	Suggested window	Severity
3 consecutive probe failures	~3 minutes if interval 60s	Page on-call
p95 latency > 800 ms internal hop	10 minutes sustained	Warning ticket
Canary LLM failure	1 failure	Slack + auto-open bridge issue

Token budget: cap canary prompts at 400 completion tokens and use the cheapest model profile that still exercises routing—save flagship models for real users.

After probes stay green, harden how API keys reach the daemon—see OpenClaw environment variables & secrets on headless leased Mac (2026-04-15) for launchd-safe precedence, OPENCLAW_STATE_DIR splits, and rotation without dropping nginx ingress.

FAQ: Probes on macOS Cloud Macs

Question	Answer
Should probes run as root?	No—use the same service user that owns `~/.openclaw` to catch permission regressions.
Where do I host secondary observers?	Use another region’s MacXCode node or your existing observability VPC; compare pricing for a small witness instance.
What if logs explode after enabling debug?	Follow structured logging guidance—debug on support windows only.

Why Mac mini M4 Metal Helps Probe Fidelity

Synthetic checks are useless if the host itself jitters from oversubscription. Bare-metal Mac mini M4 nodes give steady CPU for curl+JSON parsing, predictable NVMe for log append, and the same Apple Silicon behavior your gateway sees in development. MacXCode’s regions across HK / JP / KR / SG / US let you place observers close to user populations while keeping SSH break-glass access documented in help.

Bottom line: treat OpenClaw like any production API—define SLOs, prove them with probes, and rehearse failures before marketing promises “always on.” Scale capacity via pricing when canaries start flapping weekly.

Run OpenClaw with production-grade observability

Lease M4 nodes · HK · JP · KR · SG · US · SSH / VNC

Compare regions Deployment help