2026 OpenClaw Health Probes & Readiness Checks on Production Leased Cloud Mac
Running OpenClaw 24/7 on a leased Mac mini M4 in Hong Kong, Japan, Korea, Singapore, or the United States means your gateway on 127.0.0.1:18789 becomes part of production infrastructure. Kubernetes users already speak liveness vs readiness; macOS + launchd shops need the same discipline without a kubelet. This 2026 guide defines which signals to scrape, a probe comparison table, a six-step runbook, and alert thresholds that avoid both silent failure and pager fatigue. Pair it with gateway troubleshooting, structured logging, nginx ingress for webhooks, and Tailscale mesh access when failures span network and process layers.
Why “Process Running” Is Not a Health Check
launchd can report 0 exit while the gateway is wedged: TLS context stale, model provider DNS flapping, or partial config writes under ~/.openclaw. Good probes exercise the same code paths customer traffic hits—HTTP handlers, auth middleware, and optional downstream model pings—without hammering paid APIs.
- Liveness answers “should we restart the gateway?” — cheap, every 60 seconds.
- Readiness answers “should the load balancer send traffic?” — stricter, may include dependency checks.
- Canary sends a synthetic user message every 15 minutes to catch subtle regressions; budget tokens explicitly.
18789 on the public internet—terminate TLS on nginx or keep checks inside your tailnet per Tailscale ACLs.
Signals Operators Should Graph Before On-Call Week
Minimum viable dashboards for MacXCode customers running agents in production:
- Request rate + p95 latency from nginx
$request_timewhen a reverse proxy fronts the gateway. - Error ratio — count
5xxdivided by total; alert when > 2% for 5 minutes after excluding known maintenance windows. - CPU sustained > 85% for 10 minutes — often precedes thermal throttle on small instances; M4 rarely thermal limits but bursty embeddings can spike.
- Disk free < 12 GB on the root APFS volume — log rotation under
~/.openclaw/logsstalls when the filesystem is tight.
Probe Types: What Each Proves
| Probe | Proves | Cost / risk |
|---|---|---|
TCP connect to 127.0.0.1:18789 |
Socket accept loop alive | Low signal; misses auth failures |
HTTP GET /health (path per build) |
HTTP stack + config load | Preferred baseline liveness |
| Authenticated synthetic chat | Model routing + credentials | Consumes tokens; run as canary |
| Disk inode + free space | Log rotation health | Cheap host-level guardrail |
Six-Step Runbook: From Nothing to PagerDuty-Ready
- Baseline — capture
openclaw gateway statusoutput after clean boot; store in git. - Author probe script — curl with
--failand connect-timeout 3 seconds; exit non-zero on failure. - launchd plist —
StartInterval60;ThrottleIntervalto avoid storms; log to unified file. - Correlation IDs — echo ISO8601 timestamp per check into logs for cross-search with nginx.
- Wire alerts — three consecutive failures page; single failure posts to Slack only.
- Game day — quarterly drill: kill gateway intentionally, measure MTTR against 15 minute SLO.
curl -fsS --max-time 3 http://127.0.0.1:18789/health || exit 1
How Probes Compose with Nginx and Tailscale
When nginx terminates TLS, run liveness against the internal URL to isolate edge misconfig from gateway bugs. For tailnet-only deployments, run synthetics from a tagged probe device in Tailscale so ACL changes do not silently disable monitors.
Alerting Thresholds That Avoid Noise
| Condition | Suggested window | Severity |
|---|---|---|
| 3 consecutive probe failures | ~3 minutes if interval 60s | Page on-call |
| p95 latency > 800 ms internal hop | 10 minutes sustained | Warning ticket |
| Canary LLM failure | 1 failure | Slack + auto-open bridge issue |
After probes stay green, harden how API keys reach the daemon—see OpenClaw environment variables & secrets on headless leased Mac (2026-04-15) for launchd-safe precedence, OPENCLAW_STATE_DIR splits, and rotation without dropping nginx ingress.
FAQ: Probes on macOS Cloud Macs
| Question | Answer |
|---|---|
| Should probes run as root? | No—use the same service user that owns ~/.openclaw to catch permission regressions. |
| Where do I host secondary observers? | Use another region’s MacXCode node or your existing observability VPC; compare pricing for a small witness instance. |
| What if logs explode after enabling debug? | Follow structured logging guidance—debug on support windows only. |
Why Mac mini M4 Metal Helps Probe Fidelity
Synthetic checks are useless if the host itself jitters from oversubscription. Bare-metal Mac mini M4 nodes give steady CPU for curl+JSON parsing, predictable NVMe for log append, and the same Apple Silicon behavior your gateway sees in development. MacXCode’s regions across HK / JP / KR / SG / US let you place observers close to user populations while keeping SSH break-glass access documented in help.
Bottom line: treat OpenClaw like any production API—define SLOs, prove them with probes, and rehearse failures before marketing promises “always on.” Scale capacity via pricing when canaries start flapping weekly.
Run OpenClaw with production-grade observability
Lease M4 nodes · HK · JP · KR · SG · US · SSH / VNC