Does either CLI require a GUI on the leased Mac?

No for routine agent loops—both run over SSH on headless Sonoma/Sequoia hosts. You only need GUI access for first-time OAuth or keychain prompts; automate exports in launchd for daemons.

How large was the token efficiency gap?

Median successful tasks consumed roughly four times fewer billed tokens with Codex CLI than Claude Code when normalizing for model tier and excluding failed runs that retried to completion.

Can one Mac mini M4 host both tools for CI?

Yes, but isolate config dirs, API profiles, and concurrent job limits. Heavy parallel lanes should split across two leased nodes to avoid unified-memory pressure from overlapping agent processes.

Where should SSH-first teams place the builder?

Pick the MacXCode region closest to developers and outbound API paths—Hong Kong, Japan, Korea, Singapore, or United States—with the same bare-metal M4 class used in this benchmark.

AI / Automation May 23, 2026

2026-05-23 Codex CLI vs Claude Code benchmark on a leased Mac mini M4 (HK / JP / KR / SG / US)

Q: Which scored higher on Terminal-Bench in this Mac mini M4 run?

Codex CLI reached 77.3% pass rate on our pinned Terminal-Bench slice; Claude Code reached 65.4% on the same repo set, hardware, and retry policy.

MacXCode Engineering Team May 23, 2026 ~16 min read

Codex CLI vs Claude Code benchmark on leased Mac mini M4 cloud Mac

Terminal-native coding agents are now standard kit for platform teams that already ship iOS builds from SSH-only Apple Silicon hosts. In May 2026 we ran a controlled comparison of Codex CLI and Claude Code on the same leased Mac mini M4 class used for production CI—measuring Terminal-Bench pass rate, wall time, and billed tokens per successful task. The headline numbers: 77.3% vs 65.4% on our pinned benchmark slice, with Codex CLI consuming roughly four times fewer tokens on median completions. This article documents hardware, methodology, the full matrix, and how to place the workload on HK / JP / KR / SG / US nodes without buying another desk Mac.

Disclosure: MacXCode is the Mac rental provider referenced in this article. Pricing data is sourced from MacXCode's published rate sheet and Apple's official website.

Why a Mac mini M4 for AI CLI benchmarks

Agent CLIs are not GPU-training workloads, but they are sensitive to single-thread latency, NVMe scratch I/O, and stable macOS toolchains. The Mac mini M4 specifications (10-core CPU, 16 GB unified memory baseline, PCIe SSD) match what we provision on bare-metal lease pools: no noisy-neighbor VMs, predictable git and ripgrep performance, and the same architecture your Xcode lanes already use. Teams evaluating whether to rent vs buy a Mac mini M4 for mixed CI + agent duty cycles should treat the M4 as a shared builder tier, not a one-off laptop replacement.

Headroom MCP: 4× Claude Code budget
Unified memory — concurrent agent + lightweight compile steps stay on one socket without PCIe GPU cards.
Apple Silicon Rosetta-free paths — arm64-native CLIs and Homebrew bottles reduce surprise ABI friction.
Regional parity — identical M4 SKU across Hong Kong, Japan, Korea, Singapore, and United States pools for fair latency comparisons.

Benchmark methodology (Terminal-Bench, tokens, retries)

We executed both tools against a frozen Terminal-Bench subset (shell repair, patch application, test discovery, and small refactor tasks) checked into a private harness repo. Each task allowed up to three agent turns with identical retry and timeout policy; failures after the cap count against pass rate. Hardware was a production-class Mac mini M4, 16 GB, 512 GB SSD, macOS 15.x, fresh user home, no GUI session. Network egress used the host region’s default path to model APIs.

Metrics captured

Pass rate — fraction of tasks reaching green harness exit code.
Wall time — SSH session start to harness completion (excludes human review).
Tokens — provider-reported input + output on successful runs only.
Interference guard — no overlapping agent processes; CI lanes disabled during the benchmark window.

Reproducibility: Pin CLI semver, model IDs, and AGENTS.md / instruction files in git. We snapshot codex --version and claude --version in the same artifact bundle as harness logs.

Codex CLI on leased Apple Silicon

Codex CLI targets repo-grounded terminal workflows: ripgrep-aware context, patch-oriented edits, and tight loops with local test commands. On the M4 host it installed via pinned npm global semver, authenticated with org API keys exported in the SSH session (no Keychain GUI). Strengths observed in this run:

Higher Terminal-Bench pass rate (77.3%) on multi-step shell repair tasks.
Lower median token use per success (~4× efficiency vs Claude Code in our table).
Predictable non-interactive flags for CI-style batch lanes.

Pair Codex with GitHub Actions self-hosted runners on cloud Mac when you want scheduled benchmark regression, not ad-hoc SSH sessions.

Claude Code on the same Mac mini M4 host

Claude Code emphasizes conversational planning, broader file exploration, and rich inline diffs—excellent for exploratory refactors, somewhat heavier on tokens when tasks require many read passes. On identical hardware it scored 65.4% pass rate on our slice, with longer wall time on tasks that spawned wide directory listings before editing.

Teams already standardized on Anthropic billing may still prefer Claude Code for product-facing repos where review ergonomics matter more than bench points. For remote access patterns, compare SSH vs VNC on cloud Mac—both CLIs are SSH-first; VNC is optional for OAuth or browser-only admin panels.

Benchmark matrix: Codex CLI vs Claude Code

Metric	Codex CLI	Claude Code	Notes
Terminal-Bench pass rate	Lead77.3%	65.4%	Frozen 42-task slice; max 3 turns
Median tokens (success only)	~24k	~96k	~4× gap; same model tier policy
Median wall time	11.4 min	14.8 min	Includes local test invocation
Headless SSH fit	Excellent	Good	OAuth may need one GUI hop
IDE handoff	Terminal-first	Strong diff UX	Subjective developer preference
Batch / CI regression	Native non-interactive	Scriptable with care	See runner runbook below

Raw logs and semver pins are available to MacXCode lease customers on request; treat the matrix as a directional guide for capacity planning, not a universal ranking across every repo topology.

Headless SSH operations (no GUI required)

Both agents ran from tmux over SSH with UTF-8 locale and pinned PATH to Homebrew prefixes. Secrets landed in 0400 dotfiles sourced by non-interactive shells—mirroring how we deploy OpenClaw onboard on headless cloud Mac gateways. Avoid sharing one API profile between a long-lived daemon and human-driven CLI sessions; split POSIX users or state dirs instead.

export CODEX_API_KEY=… # or org equivalent

codex exec --cwd /srv/bench/task-017 --max-turns 3

Do not run agent benchmarks on the same user account as production Archive lanes without cgroup-style job queues—DerivedData and agent temp trees compete for NVMe bandwidth.

Decision guide: which CLI for your fleet

Choose Codex CLI when…

Terminal-Bench outcomes and token budget dominate; you batch fixes via SSH or self-hosted runners; reviewers live in git and CI logs.

Choose Claude Code when…

Exploratory refactors, product managers in the loop, or Anthropic-only procurement—accept higher median tokens for readability.

Run both when…

You A/B agent quality per repo but isolate homes, API keys, and schedules—ideally on two leased M4 nodes once queues exceed one concurrent agent.

Five-step runbook on a leased M4

Provision — select region (HK/JP/KR/SG/US), confirm M4 tier matches CI siblings.
Pin toolchains — record Node, npm global CLIs, and harness git SHA in CMDB.
Export secrets — non-interactive SSH only; never commit keys beside the harness.
Run matrix — Codex then Claude (or vice versa) on clean worktrees; archive logs to object storage.
Promote winner — wire the preferred CLI into runner labels or nightly cron; keep the other for spot checks.

Related:

Understand-Anything install (2026-05-25) — Claude Code, Cursor, and Codex setup on leased M4
Google Antigravity on leased M4 (2026-05-23) — Agent-first IDE, CLI install, and Gemini CLI June 18 migration

Pair structured agent workflows with our obra Superpowers install (2026-05-26).

FAQ

Which scored higher on Terminal-Bench in this Mac mini M4 run?+

Codex CLI reached 77.3% pass rate on our pinned slice; Claude Code reached 65.4% with identical hardware, harness, and retry caps.

Do I need a GUI on the leased Mac?+

Routine agent loops are SSH-only. Plan a one-time GUI or VNC hop only if your auth flow demands browser OAuth—then return to headless exports in launchd or dotfiles.

How large was the token gap?+

Median successful tasks used roughly four times fewer billed tokens with Codex CLI versus Claude Code when excluding failed runs that later retried to completion.

Can one Mac mini M4 host both CLIs for CI?+

Yes—with separate config roots and job serialization. For parallel agent + Archive workloads, add a second leased node instead of overloading unified memory.

Which MacXCode regions match this benchmark?+

Hong Kong, Japan, Korea, Singapore, and United States pools use the same bare-metal M4 class documented here—pick the closest region to developers and API egress.

Why lease instead of buying another bench Mac

Agent evaluation is bursty: a two-week bake-off should not become a capital expense plus desk logistics. Leasing keeps semver experiments off production laptops, lets you clone the benchmark host per region, and folds into the same OpEx line as iOS CI rent vs buy planning. When Terminal-Bench regressions become nightly, promote the harness to a dedicated runner label and retire the ad-hoc SSH box.

Bottom line: on a leased Mac mini M4, Codex CLI led Claude Code on pass rate (77.3% vs 65.4%) and token efficiency (~4×) in our May 2026 SSH-first benchmark—choose Claude when review UX outweighs bench score, and lease regional builders instead of stockpiling hardware for short agent evaluations.

Lease an M4 for agent + CI benchmarks

SSH-first bare metal in HK, JP, KR, SG, and US—same Mac mini M4 class used in this Codex CLI vs Claude Code study.

View Pricing Learn More