Skip to content

mlhher/late-cli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

125 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Late: High-Leverage AI Agent Orchestration

English | 中文

Every other coding agent floods its own context with edits, retries and implementation details until the model loses the thread. Late delegates all of that to ephemeral subagents — isolated contexts that execute one task and are destroyed. The orchestrator sees only plans and outcomes, never the mess. Single static binary, zero dependencies, any model.

Release Homebrew Go Report Card DeepWiki

Drop into any project and start building. Get to your first prompt in less than 10 seconds.

brew tap mlhher/late && brew install late
cd your-project
late

Not using Homebrew?

  • Arch Linux: yay -S late-cli-bin
  • Linux / macOS / Windows: Download the latest binary and drop it in your PATH. (macOS manual download: if blocked, run xattr -d com.apple.quarantine /path/to/late)

Connecting to Cloud Models? Local models (llama.cpp on :8080, the default for llama-server) work out-of-the-box. No configuration required. For cloud providers (DeepSeek, Claude, Gemini, OpenRouter), set your OPENAI_BASE_URL, OPENAI_API_KEY, and OPENAI_MODEL environment variables.

Late Orchestrator planning a multi-phase implementation and spawning the first subagent Lead Architect forming a plan and spawning an atomic subagent for a surgical edit.

Late Claude Code OpenCode The Weekly Clone
Workflow Autonomous Orchestration Manual toggling Manual toggling Blind execution/Manual toggling
Implementations Ephemeral subagents (Context destroyed) Floods main context window Floods main context window Floods main context window
KV-Cache Ruthless KV cache management Brute-force context dumping Brute-force context dumping Brute-force context dumping
System Prompt ~1,000 tokens (Always planning workflow) 10,000+ tokens 10,000+ tokens ~300-1000+ tokens (No-workflow lobotomy)
Dependencies Zero-dependency static binary Node.js Node.js Python / Node.js
Setup required None (OOTB llama-server support) Anthropic OAuth / Sign-in Mandatory JSON tweaking Flavor of the week JSON/YAML/TOML configs
Built For Builders wanting 10x throughput Enterprise expense accounts Tinkering with settings Chasing GitHub stars

"The same model feels smarter with Late." — Reddit

"Late-CLI is mindblowing... I'm shocked that the token usage is so minimal, I keep expecting a big bill from DeepSeek's API." — GitHub Discussions

Outperforming Claude Code and Codex for Local LLM Workflows — Agent Native

Built with Late: Late is primarily developed inside Late itself.

Works with Claude, DeepSeek, Qwen, Gemma (including thinking support for Gemma), and any OpenAI-compatible API. See the Quickstart Guide for hybrid model routing, keybindings, MCP setup, Skills and more.


How It Works

Standard coding agents do all their work, whether it's planning, implementing, retrying failed edits, or self-healing, in one shared context window. Every retry, every failed implementation, every repair loop pollutes the context the model reasons from. It degrades. You blame the model. The model is fine.

Late separates concerns. A lean orchestrator (~1,000 token system prompt) reads your codebase, forms a plan, and delegates individual implementation tasks to ephemeral subagents. Each subagent gets a fresh isolated context containing only its one task and nothing else. When it completes, that context is destroyed. The orchestrator only ever sees outcomes.

Late manages the KV cache and context window carefully, leaving more room for reasoning. The orchestrator's context grows only from what matters: your instructions and the agent's decisions. Everything the subagent did to get there is gone with it. This is why the same model feels sharper in Late. It reasons from signal, not noise.


Features

  • Hybrid Model Routing: Architect the plan with a massive reasoning model (e.g., DeepSeek V4), then spawn subagents to execute it using blazing-fast, cheap local models (e.g., Gemma 4).
  • Exact-Match Diffs: Strict search/replace blocks with autonomous self-healing on mismatch. Edits fail loud. We never silently corrupt your files.
  • Human-in-the-Loop: Read-only commands are auto-approved for velocity. Mutations hard-stop for [y/N]. Features Session, Project, and Global trust scopes with TTL decay.
  • Stateful Resilience: The Orchestrator maintains continuous session history on disk. Close your terminal, reboot your machine, and pick up exactly where you left off.
  • MCP Integration: Natively map external Model Context Protocol servers directly into Late via standard I/O.
  • Agent Skills: Drop in reusable sets of instructions and scripts. Zero configuration or boilerplate required.
  • Git Worktree Support: Run independent, parallel agent instances across multiple branches without context bleeding.
  • Gemma 4 Thinking Mode: Standard wrappers just pipe text to an API, which means they can't trigger Gemma's reasoning. Late includes a dedicated flag to inject the exact tokens required to actually make it think.

License

Built to create engineering leverage, not to supply free infrastructure for AI startups.

Free for builders: Use Late freely to write code for any project, including commercial ones. Your output is yours.

Commercial restrictions: You may not monetize Late itself. Wrapping the orchestration engine into a paid service or deploying it as enterprise infrastructure requires a commercial agreement.

Late converts to GPLv2 on February 21, 2030. Full license in LICENSE.

About

Orchestrate an entire AI dev team on 5GB VRAM. Ephemeral subagents, exact-match diffs. Single static binary, any model. Zero config, zero context bloat.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors