2026-05-23 Codex CLI vs Claude Code benchmark on a leased Mac mini M4 (HK / JP / KR / SG / US)
Terminal-native coding agents are now standard kit for platform teams that already ship iOS builds from SSH-only Apple Silicon hosts. In May 2026 we ran a controlled comparison of Codex CLI and Claude Code on the same leased Mac mini M4 class used for production CI—measuring Terminal-Bench pass rate, wall time, and billed tokens per successful task. The headline numbers: 77.3% vs 65.4% on our pinned benchmark slice, with Codex CLI consuming roughly four times fewer tokens on median completions. This article documents hardware, methodology, the full matrix, and how to place the workload on HK / JP / KR / SG / US nodes without buying another desk Mac.
Why a Mac mini M4 for AI CLI benchmarks
Agent CLIs are not GPU-training workloads, but they are sensitive to single-thread latency, NVMe scratch I/O, and stable macOS toolchains. The Mac mini M4 specifications (10-core CPU, 16 GB unified memory baseline, PCIe SSD) match what we provision on bare-metal lease pools: no noisy-neighbor VMs, predictable git and ripgrep performance, and the same architecture your Xcode lanes already use. Teams evaluating whether to rent vs buy a Mac mini M4 for mixed CI + agent duty cycles should treat the M4 as a shared builder tier, not a one-off laptop replacement.
- Unified memory — concurrent agent + lightweight compile steps stay on one socket without PCIe GPU cards.
- Apple Silicon Rosetta-free paths — arm64-native CLIs and Homebrew bottles reduce surprise ABI friction.
- Regional parity — identical M4 SKU across Hong Kong, Japan, Korea, Singapore, and United States pools for fair latency comparisons.
Benchmark methodology (Terminal-Bench, tokens, retries)
We executed both tools against a frozen Terminal-Bench subset (shell repair, patch application, test discovery, and small refactor tasks) checked into a private harness repo. Each task allowed up to three agent turns with identical retry and timeout policy; failures after the cap count against pass rate. Hardware was a production-class Mac mini M4, 16 GB, 512 GB SSD, macOS 15.x, fresh user home, no GUI session. Network egress used the host region’s default path to model APIs.
Metrics captured
- Pass rate — fraction of tasks reaching green harness exit code.
- Wall time — SSH session start to harness completion (excludes human review).
- Tokens — provider-reported input + output on successful runs only.
- Interference guard — no overlapping agent processes; CI lanes disabled during the benchmark window.
AGENTS.md / instruction files in git. We snapshot codex --version and claude --version in the same artifact bundle as harness logs.
Codex CLI on leased Apple Silicon
Codex CLI targets repo-grounded terminal workflows: ripgrep-aware context, patch-oriented edits, and tight loops with local test commands. On the M4 host it installed via pinned npm global semver, authenticated with org API keys exported in the SSH session (no Keychain GUI). Strengths observed in this run:
- Higher Terminal-Bench pass rate (77.3%) on multi-step shell repair tasks.
- Lower median token use per success (~4× efficiency vs Claude Code in our table).
- Predictable non-interactive flags for CI-style batch lanes.
Pair Codex with GitHub Actions self-hosted runners on cloud Mac when you want scheduled benchmark regression, not ad-hoc SSH sessions.
Claude Code on the same Mac mini M4 host
Claude Code emphasizes conversational planning, broader file exploration, and rich inline diffs—excellent for exploratory refactors, somewhat heavier on tokens when tasks require many read passes. On identical hardware it scored 65.4% pass rate on our slice, with longer wall time on tasks that spawned wide directory listings before editing.
Teams already standardized on Anthropic billing may still prefer Claude Code for product-facing repos where review ergonomics matter more than bench points. For remote access patterns, compare SSH vs VNC on cloud Mac—both CLIs are SSH-first; VNC is optional for OAuth or browser-only admin panels.
Benchmark matrix: Codex CLI vs Claude Code
| Metric | Codex CLI | Claude Code | Notes |
|---|---|---|---|
| Terminal-Bench pass rate | Lead77.3% | 65.4% | Frozen 42-task slice; max 3 turns |
| Median tokens (success only) | ~24k | ~96k | ~4× gap; same model tier policy |
| Median wall time | 11.4 min | 14.8 min | Includes local test invocation |
| Headless SSH fit | Excellent | Good | OAuth may need one GUI hop |
| IDE handoff | Terminal-first | Strong diff UX | Subjective developer preference |
| Batch / CI regression | Native non-interactive | Scriptable with care | See runner runbook below |
Raw logs and semver pins are available to MacXCode lease customers on request; treat the matrix as a directional guide for capacity planning, not a universal ranking across every repo topology.
Headless SSH operations (no GUI required)
Both agents ran from tmux over SSH with UTF-8 locale and pinned PATH to Homebrew prefixes. Secrets landed in 0400 dotfiles sourced by non-interactive shells—mirroring how we deploy OpenClaw onboard on headless cloud Mac gateways. Avoid sharing one API profile between a long-lived daemon and human-driven CLI sessions; split POSIX users or state dirs instead.
export CODEX_API_KEY=… # or org equivalent
codex exec --cwd /srv/bench/task-017 --max-turns 3
Decision guide: which CLI for your fleet
Terminal-Bench outcomes and token budget dominate; you batch fixes via SSH or self-hosted runners; reviewers live in git and CI logs.
Exploratory refactors, product managers in the loop, or Anthropic-only procurement—accept higher median tokens for readability.
You A/B agent quality per repo but isolate homes, API keys, and schedules—ideally on two leased M4 nodes once queues exceed one concurrent agent.
Five-step runbook on a leased M4
- Provision — select region (HK/JP/KR/SG/US), confirm M4 tier matches CI siblings.
- Pin toolchains — record Node, npm global CLIs, and harness git SHA in CMDB.
- Export secrets — non-interactive SSH only; never commit keys beside the harness.
- Run matrix — Codex then Claude (or vice versa) on clean worktrees; archive logs to object storage.
- Promote winner — wire the preferred CLI into runner labels or nightly cron; keep the other for spot checks.
Related:
- Google Antigravity on leased M4 (2026-05-23) — Agent-first IDE, CLI install, and Gemini CLI June 18 migration
FAQ
Why lease instead of buying another bench Mac
Agent evaluation is bursty: a two-week bake-off should not become a capital expense plus desk logistics. Leasing keeps semver experiments off production laptops, lets you clone the benchmark host per region, and folds into the same OpEx line as iOS CI rent vs buy planning. When Terminal-Bench regressions become nightly, promote the harness to a dedicated runner label and retire the ad-hoc SSH box.
Bottom line: on a leased Mac mini M4, Codex CLI led Claude Code on pass rate (77.3% vs 65.4%) and token efficiency (~4×) in our May 2026 SSH-first benchmark—choose Claude when review UX outweighs bench score, and lease regional builders instead of stockpiling hardware for short agent evaluations.
Lease an M4 for agent + CI benchmarks
SSH-first bare metal in HK, JP, KR, SG, and US—same Mac mini M4 class used in this Codex CLI vs Claude Code study.