Every other coding agent floods its own context with edits, retries and implementation details until the model loses the thread. Late delegates all of that to ephemeral subagents — isolated contexts that execute one task and are destroyed. The orchestrator sees only plans and outcomes, never the mess. Single static binary, zero dependencies, any model.
Drop into any project and start building. Get to your first prompt in less than 10 seconds.
brew tap mlhher/late && brew install late
cd your-project
lateNot using Homebrew?
- Arch Linux:
yay -S late-cli-bin- Linux / macOS / Windows: Download the latest binary and drop it in your PATH. (macOS manual download: if blocked, run
xattr -d com.apple.quarantine /path/to/late)Connecting to Cloud Models? Local models (llama.cpp on
:8080, the default for llama-server) work out-of-the-box. No configuration required. For cloud providers (DeepSeek, Claude, Gemini, OpenRouter), set yourOPENAI_BASE_URL,OPENAI_API_KEY, andOPENAI_MODELenvironment variables.
Lead Architect forming a plan and spawning an atomic subagent for a surgical edit.
| Late | Claude Code | OpenCode | The Weekly Clone | |
|---|---|---|---|---|
| Workflow | Autonomous Orchestration | Manual toggling | Manual toggling | Blind execution/Manual toggling |
| Implementations | Ephemeral subagents (Context destroyed) | Floods main context window | Floods main context window | Floods main context window |
| KV-Cache | Ruthless KV cache management | Brute-force context dumping | Brute-force context dumping | Brute-force context dumping |
| System Prompt | ~1,000 tokens (Always planning workflow) | 10,000+ tokens | 10,000+ tokens | ~300-1000+ tokens (No-workflow lobotomy) |
| Dependencies | Zero-dependency static binary | Node.js | Node.js | Python / Node.js |
| Setup required | None (OOTB llama-server support) |
Anthropic OAuth / Sign-in | Mandatory JSON tweaking | Flavor of the week JSON/YAML/TOML configs |
| Built For | Builders wanting 10x throughput | Enterprise expense accounts | Tinkering with settings | Chasing GitHub stars |
"The same model feels smarter with Late." — Reddit
"Late-CLI is mindblowing... I'm shocked that the token usage is so minimal, I keep expecting a big bill from DeepSeek's API." — GitHub Discussions
Outperforming Claude Code and Codex for Local LLM Workflows — Agent Native
Built with Late: Late is primarily developed inside Late itself.
Works with Claude, DeepSeek, Qwen, Gemma (including thinking support for Gemma), and any OpenAI-compatible API. See the Quickstart Guide for hybrid model routing, keybindings, MCP setup, Skills and more.
Standard coding agents do all their work, whether it's planning, implementing, retrying failed edits, or self-healing, in one shared context window. Every retry, every failed implementation, every repair loop pollutes the context the model reasons from. It degrades. You blame the model. The model is fine.
Late separates concerns. A lean orchestrator (~1,000 token system prompt) reads your codebase, forms a plan, and delegates individual implementation tasks to ephemeral subagents. Each subagent gets a fresh isolated context containing only its one task and nothing else. When it completes, that context is destroyed. The orchestrator only ever sees outcomes.
Late manages the KV cache and context window carefully, leaving more room for reasoning. The orchestrator's context grows only from what matters: your instructions and the agent's decisions. Everything the subagent did to get there is gone with it. This is why the same model feels sharper in Late. It reasons from signal, not noise.
- Hybrid Model Routing: Architect the plan with a massive reasoning model (e.g., DeepSeek V4), then spawn subagents to execute it using blazing-fast, cheap local models (e.g., Gemma 4).
- Exact-Match Diffs: Strict
search/replaceblocks with autonomous self-healing on mismatch. Edits fail loud. We never silently corrupt your files. - Human-in-the-Loop: Read-only commands are auto-approved for velocity. Mutations hard-stop for
[y/N]. Features Session, Project, and Global trust scopes with TTL decay. - Stateful Resilience: The Orchestrator maintains continuous session history on disk. Close your terminal, reboot your machine, and pick up exactly where you left off.
- MCP Integration: Natively map external Model Context Protocol servers directly into Late via standard I/O.
- Agent Skills: Drop in reusable sets of instructions and scripts. Zero configuration or boilerplate required.
- Git Worktree Support: Run independent, parallel agent instances across multiple branches without context bleeding.
- Gemma 4 Thinking Mode: Standard wrappers just pipe text to an API, which means they can't trigger Gemma's reasoning. Late includes a dedicated flag to inject the exact tokens required to actually make it think.
Built to create engineering leverage, not to supply free infrastructure for AI startups.
Free for builders: Use Late freely to write code for any project, including commercial ones. Your output is yours.
Commercial restrictions: You may not monetize Late itself. Wrapping the orchestration engine into a paid service or deploying it as enterprise infrastructure requires a commercial agreement.
Late converts to GPLv2 on February 21, 2030. Full license in LICENSE.