Writing
Have your coding agent work in a different timezone
I stopped watching my coding agent work. I used to sit there like it was a screensaver — leaning forward every time Claude Code asked "Can I run this command?" and spamming yes-yes-yes through a gauntlet of safety prompts. Whack-a-mole for grown-ups. It felt productive because the terminal was moving. It wasn't. The bottleneck was me, not the model. So I built a pipeline that doesn't need me in the loop at all — and now my agent works the night shift while I review pull requests in the morning.
// the problem with babysitting
Claude Code has its own rhythm. It writes a file, pauses, asks for approval. Runs a command, pauses, asks for approval. Each prompt is individually reasonable — you don't want an agent deleting your home directory — but the cumulative cost is brutal. You can't walk away for ten minutes without coming back to a frozen session waiting on a yes/no. So you sit there. You babysit. Which completely defeats the point of having an AI do the work.
The answer isn't to disable the guardrails. It's to make them autonomous too.
// the pi agent harness
I've been building on pi agent for a while now — the terminal-based coding agent by Mario Zechner that's aggressively extensible via TypeScript extensions. It supports 15+ providers, hundreds of models, and a session model that gives you full replay and branch-off. Everything I'm about to describe is built as pi extensions, running in tmux, completely detached from my own session.
The core insight: pi's extension API gives you hooks into tool calls, commands, and the session lifecycle. That means you can replace the default approval flow, wire up multi-step pipelines, and manage git branches — all in type-safe TypeScript that runs as a first-class part of the agent. You're not scripting around the tool; you're extending it from the inside.
// the pipeline: tickets, features, and the queue
The workflow starts with three planning commands, then one execution command, then one that ties it all together:
# Plan a single, focused task
/ticket-plan "add pull-to-refresh to the feed view"
# Decompose a feature into ordered steps
/feature-plan "implement offline caching for the feed endpoint"
# Show the queue of everything in work/
/work-queue list
/ticket-plan takes a natural language description and produces a single markdown file in work/ — a compact spec with intent, files to modify, verification checks, and a complexity estimate. It reads the codebase to scope the change, biases toward "simple", and rejects anything that should be a feature instead.
/feature-plan does the interesting thing. It reads the codebase, then calls write_feature_plan which produces two things: an umbrella file (work/save-routes.md) that describes the whole feature and links each step, and a series of step stubs (work/save-routes-01-auth-client.md, work/save-routes-02-cutover-views.md, …). Each stub includes a per-step complexity estimate and model assignment, but intentionally stays thin — just title, intent, and scope notes. You expand them with /plan when you're ready to execute that step. This prevents the kind of speculative over-documentation that goes stale before you ship.
Then the execution layer takes over:
# Run every step of a feature, auto-merging each, then open one PR
/work-feature save-routes
# Drain the entire work/ folder sequentially into PRs
/work-queue main
/work-feature is where the magic happens. Given a prefix like save-routes, it:
- Creates a feature branch from
main - For each step file in order, cuts a step branch from the feature branch
- Runs the full pipeline: plan → review → implement → verify → summarise
- On success, merges the step branch back into the feature branch (fast-forward) and deletes it
- When all steps are merged, pushes the feature branch and opens a PR
Each step ends on a green build. Each merge carries the verified state forward. If any step fails, the feature branch is left in place with the merged-and-working stages, the failed step branch available for inspection. No half-finished state, no orphan branches, no wondering where it stopped.
/work-queue takes this further. It enumerates everything in work/ — features and standalone tickets — and processes them sequentially, returning to the base branch between each one. Every item becomes its own pull request. You kick it off and come back to a stack of PRs, each with a full summary, token usage, and cost breakdown.
// the judge: autonomous security
The reason you can't leave Claude Code unattended is its approval model. Pi has a built-in budget-model approval system, but it doesn't work with opencode-go's API, so everything falls back to "ask the human." For an autonomous pipeline, that's a dead end.
So I wrote the judge — a pi extension that hooks into tool_call events and replaces the default approval flow entirely. It's a two-tier system:
- Static rules — in-repo file edits, builds, git operations, and any paths the project explicitly allows (via
.pi/judge.json) pass through untouched. No model call, no latency. - Dynamic judgement — anything that looks risky (matching built-in or project-specific patterns like
rm -rf,sudo,git push --force, paths that leave the repo) gets sent to the judge model — Deepseek V4 Flash — via opencode-go's API. The model gets the full context: repo root, proposed action, and project-specific instructions. It returns APPROVE or DENY with a one-line reason.
The configuration lives in .pi/judge.json — you can declare extra paths the agent is allowed to touch outside the repo, deny patterns for project-specific secrets, and guidance text for the judge model. With no config file, the judge is a no-op. It's safe to load globally.
The effect is that the agent can autonomously edit files, run builds, commit, and branch — but the moment it tries something genuinely dangerous, a model reasons about whether to allow it. The pipeline doesn't stall. The guardrails stay on. The human stays asleep.
// complexity-based model routing
Every ticket and every feature step carries a complexity estimate: simple, medium, or complex. This isn't a suggestion — it determines which model runs each phase of the pipeline:
// per-tier model assignment for the /work pipeline
simple → all phases on Deepseek V4 Flash ($0.14/m input)
medium → plan + implement on V4 Pro ($1.74/m), review on Flash
complex → plan + review + implement on V4 Pro, verify + summarise on Flash
// verify and summarise are always on Flash — they're mechanical
Bias is set to LOW — default to "simple" and only escalate when flash will genuinely struggle. The pipeline runs on a throwaway branch that isn't merged without approval. A cheap run that falls short is trivial to ditch and re-run with stronger models. A wrongly-expensive run just wastes time and money.
In practice, 95% of coding tasks are "simple". Flash handles them. Models like Deepseek V4 Flash (and its open-source weights) are shockingly capable for code — they understand context, follow instructions, and produce correct implementations for the vast majority of what a working developer actually needs day-to-day: tweak a view, add an endpoint, write a test, refactor a method.
// the PRs come with spreadsheets
When the pipeline finishes and opens a pull request, the PR body doesn't just say "done." It contains the summarise output from every stage — what was implemented, key decisions, files changed, verification results — and a full cost breakdown:
## Token Usage
| Phase | Model | Input | Output | Total | Cost |
|-------------|---------------------------|---------:|---------:|---------:|---------:|
| plan | opencode-go/deepseek-v4-flash | 12,847 | 1,234 | 14,081 | $0.0032 |
| review | opencode-go/deepseek-v4-flash | 8,912 | 892 | 9,804 | $0.0021 |
| implement | opencode-go/deepseek-v4-flash | 24,563 | 5,678 | 30,241 | $0.0068 |
| verify | opencode-go/deepseek-v4-flash | 6,701 | 1,203 | 7,904 | $0.0017 |
| summarise | opencode-go/deepseek-v4-flash | 9,445 | 2,101 | 11,546 | $0.0028 |
| Total | | 62,468 | 11,108 | 73,576 | $0.0166 |
Every PR becomes a searchable record of what was done, what it cost, and how it was verified. The cost data comes from pi's session files — the orchestrator parses the JSONL usage entries, maps them to known pricing tables (Deepseek, Kimi, GLM, Qwen, MiniMax — whatever model you used), and produces a per-phase table. You can see exactly which phase burned the most tokens and decide whether to swap models for the next iteration.
// the case study: knoop's watchOS feature, $0.44
Knoop is a side project — a Dutch cycling knooppunten app that puts your next junction number on your watch. The watchOS navigation feature needed: a route-following interface with compass-like direction cues, automatic stage advancement on arrival, complication updates, phone-to-watch state sync over WatchConnectivity, and a glanceable next-node display that works without internet. Five distinct engineering challenges, each touching different layers of a shared architecture.
I wrote the spec as a feature umbrella and five step stubs. Kicked off /work-feature before bed. In the morning, there was a PR — five merged steps, a working watchOS feature, and a total cost of $0.44.
Forty-four cents. For a multi-file, multi-platform feature with concurrency, state sync, and a UI that needed to be right because you glance at it while riding a bike.
The steps were all rated "simple" or "medium". The heavy architectural reasoning (how to model the shared state, where to put the WatchConnectivity bridge, how to handle the handoff between phone and watch) happened in the plan phase on V4 Pro for a couple of the steps. The implementation — writing the actual SwiftUI views, the WCSession delegate, the complication timeline provider — ran on Flash. The model that costs $0.14 per million input tokens wrote production watchOS code that compiled first time and handled edge cases like session disconnection and background app refresh correctly.
// session management: the undo button you actually want
The pipeline is about autonomy. But autonomy without recovery is reckless. Pi's session model is the safety net: every phase of every pipeline run produces a full session file — every message, every tool call, every token count — stored in work/<slug>/sessions/.
When something goes wrong — and it does, maybe 1% of the time — you can go back to any phase, replay the session up to the failure point, change the model, edit the prompt, and continue. The per-phase handoff model means the context is clean: you're not replaying 200 messages of a failed implementation; you're restarting from the last verified plan.
I've used this maybe half a dozen times across hundreds of pipeline runs. The pipeline is 99% reliable for the kind of work I do — well-scoped tickets with clear verification criteria and steps that end on a green build. When it does go wrong, it's almost always a prompt that was too vague, and the fix is to expand the spec and re-run. But having the session replay available means I never hesitate to hit go on something ambitious. The downside is bounded.
// you don't need claude fable
There's a narrative in the AI coding space that you need the most expensive model for everything. That you're wasting your time if you're not running Claude Opus or GPT-5 or whatever the latest $10/m-token model is. The subtext is that cheaper models are toys — fine for chat, not for real work.
That's wrong. Or at least, it's not the whole story. The trick isn't the model — it's how you use it.
Deepseek V4 Flash costs $0.14 per million input tokens. Compare that to the $10+ range for premium models. That's a 70x difference. And for the structured, well-scoped work that the pipeline produces — tickets with clear intent, files specified, verification criteria defined — Flash handles 95% of it correctly. The remaining 5% is genuinely tricky reasoning: concurrency bugs, protocol conformance edge cases, complex state machines. For those, you escalate to V4 Pro ($1.74/m) on the phases that need it. You don't pay premium prices to verify that a build compiles or to write a summary.
The pipeline's complexity model bakes this in. Simple tasks run entirely on Flash. Medium tasks use Pro for planning and implementation (the reasoning-heavy phases) but Flash for review — because reviewing a diff produced by a strong model is the safest place to economise. Complex tasks use Pro for three phases. The cost scales with the cognitive load, not with the brand name.
And if you want to go further off-grid: the ds4 project lets you run Deepseek V4 Flash on a 128GB MacBook entirely offline. No API calls, no rate limits, no token costs. The same model that costs pennies in the cloud runs for free on your own hardware. A 128GB MacBook is expensive, but if you already have one, your marginal cost of coding agent usage drops to zero.
// what this changes
The shift from babysitting to autonomous pipelines changes how I think about the workday. My morning routine is now: open the PR stack from overnight, review the diffs, read the cost tables, merge the ones that look right, expand the stubs for anything the pipeline queried about, and re-queue. The actual coding — the implementation — happened while I was asleep.
This inverts the economics entirely. In the old model, my attention was the scarce resource. I could only babysit one agent at a time, one feature at a time, with a ceiling of maybe two or three meaningful changes per day. Now the bottleneck is my review velocity — how fast I can verify and merge work that was produced autonomously. The agent processes a queue overnight. I process the output in the morning. The pipeline doesn't sleep.
The tools to build this are all available today. Pi agent's extension API gives you the harness. Openrouter or opencode-go gives you cheap, fast models. Tmux gives you process isolation and live monitoring. The rest is just wiring — hooking into tool calls, managing git branches, parsing session files for cost data. I've published the extensions I use as reusable patterns, and the judge in particular is a few hundred lines of TypeScript that replaces a whole category of "can I leave my agent unattended?" anxiety.
Set up your pipeline. Write your tickets. Go to sleep. Your agent will leave you a stack of PRs and an invoice the size of a coffee.
— AM, Amsterdam, June 2026