Maestro
Architecture

Decomposing the cold-leads agent: a 9× cost reduction

Cold-leads used to run on a 130-line instructions blob and burn 110K tokens per run. Replacing the blob with a graph cut cost to $0.04 — and the drafts got better. Here's why.

When we shipped the v1 hero score in March, the cold-leads agent worked. It found leads, drafted opener emails, wrote them to the Pipeline for human review. Three design partners had it running on real ICPs. Cost per run: roughly 38 cents.

Six weeks later, the same agent runs on the same ICPs and costs about four cents. Same leads. Better drafts. Faster runs.

The architecture changed underneath. This is what changed and why.

The problem with the blob

The original cold-leads was a single agent with a single 130-line prose instructions field. Every cron tick, the runtime handed that prose to Claude as a system prompt and let the model decide what to do next, tool call by tool call, until it hit end_turn.

A typical run looked like this in the timeline:

[llm]      thinking… (1 tool call)
[apollo]   find_leads · 10 candidates
[llm]      thinking… (3 tool calls)
[pipeline] find_contact · not found
[apollo]   bulk_enrich_leads · 3 matches
[apollo]   enrich_domain · ok
[llm]      thinking… (2 tool calls)
[compose]  draft_personalized_opener · ok
[pipeline] add_contact · ok
[pipeline] log_activity · ok
... (repeats per candidate, with thinking turns between every dispatch)
[llm]      "Here's what I did: ..."

About 110,000 tokens for one run that produced three email drafts.

The model spent most of those tokens not on the work but on the coordination of the work. Every tool call was preceded by a thinking turn — Claude reading its own past tool results, deciding what to do next, planning the next dispatch. Multiply that by ~10 dispatches per run, and the per-step orchestration cost dwarfed the actual reasoning.

The other failure mode was less visible but worse: the agent re-enriched the same Apollo records on every cron tick. The instructions told it to dedupe via a pipeline.find_contact lookup — but that lookup happened after enrichment, when the credit had already been spent. Each duplicate lead cost an Apollo credit it didn’t need to.

Both of these are symptoms of the same thing: the agent’s instructions blob was the architecture. Decisions a for loop should be making (iterate the candidates), conditionals a find_contact short-circuit should be making (skip if already in pipeline), transformations a static map should be making (intent → pipeline stage) — all of them landed on the LLM’s plate, in prose.

What “decomposed” means

The shift was structural: the agent’s behavior moved from prose into a graph of typed nodes connected by typed edges.

Three node kinds:

Cold-leads decomposed into ten nodes — two LLM nodes (one shortlist filter, one draft writer) and eight deterministic ones — with the per-candidate work scoped inside a map_over.

The runtime walks the graph topologically. Each node sees its merged input from upstream edges, executes, and writes its output for downstream nodes. Branches are handled by output ports (exists vs not_found on find_contact); short-circuits are handled by _skip_iter flags or per-node config knobs (skip_iter_when_field_set: contact).

Two genuine LLM calls per run, instead of N+1 thinking turns. The deterministic nodes do their part of the work without burning tokens on coordination.

What the numbers actually look like

Same workspace, same Apollo plan, same ICP, same network. We ran the LLM-loop version and the graph version against the same candidate pool back-to-back:

LLM-loop (v1)Graph (v1.1)
Apollo credits33
Anthropic input tokens~107,000~5,400
Anthropic output tokens~3,200~1,200
USD cost$0.38$0.04
Wall time~1m 07s~22s
Drafts produced33

About 9× cheaper end-to-end. Apollo cost is identical (same calls, same per-record price). The savings are entirely from removing the LLM’s per-step orchestration overhead.

Three runs over the same week landed in the $0.03–$0.05 range. The variance is dominated by the draft LLM’s output length per candidate, not by orchestration.

Side effects we didn’t expect

The drafts got better. This was a surprise. The LLM-loop version had to read its own tool results back as context on every turn — bulk_enrich returned a 4KB JSON blob, enrich_domain returned another, and the model carried all of it forward in conversation history while drafting. Result: the drafts often regressed to generic openers that mentioned “your company” without grounding in specifics.

The graph version’s draft node sees only the upstream node outputs that explicitly flowed into it via edges. The system prompt asks for an opener grounded in a specific signal (funding stage, headcount, tech stack). The merged input contains exactly those fields, plus the lead’s name and title — and nothing else. The LLM can’t help but ground in specifics, because there’s nothing else to ground in.

A real opener from the latest run, lightly anonymized:

Subject: AI agent orchestration for ForUsAll’s EKS stack

Hi Bhavani,

Noticed ForUsAll is running Kubernetes on EKS with ChatGPT and Copilot already in the mix — looks like AI tooling is becoming a real part of your stack.

Maestro is a self-hostable platform for scheduling and chaining AI agents into multi-step workflows, which tends to fit well when teams start stringing together multiple AI calls.

Worth a 20-minute call to see if it’s relevant? Happy to work around your schedule.

The “EKS with ChatGPT and Copilot” detail came from apollo.enrich_domain’s technologies array — the LLM saw it on a pre-formatted edge and pulled the concrete signal into the opener. The LLM-loop version had access to the same data, but buried in 4KB of JSON noise; it usually wrote “your modern stack” instead.

Cross-run dedup actually works now. Adding apollo_person_id to the contacts table and letting the orchestrator short-circuit on skip_iter_when_field_set: contact before enrichment means the second run on the same Apollo search consumes zero credits for already-known leads. With the LLM-loop version, the dedup lived in prose (“if email is empty, skip”) and missed the case where the email had been revealed on a previous run — every duplicate cost a credit.

Run summaries got cheap. The LLM-loop version ended every run with “tell me what you did” — typically 200–400 output tokens of prose recapping work the dashboard already showed. The graph version generates the run summary deterministically from the score_run_nodes rows: “Drafted 3 openers (Sarah, Alex, Maria). Apollo: 3 credits. Anthropic: 5.4K tokens · $0.04.” Same information, no tokens.

What didn’t get easier

Two things to flag honestly. Both were real, both got smaller as the architecture matured, and both are still recurring shapes.

Edge transforms surface unexpected ergonomic questions. When does a rename also identity-merge? When does an empty transform mean “no data, sequencing only”? When the find_contact node returns {contact: null}, do downstream nodes see that field, and does it matter? We got these wrong twice and shipped patches both times. The third time we got the rule right (null = identity, {X: Y} = identity + renames, no special case for {}), and the rule turned out to be obvious in retrospect — but only after writing the bug.

Existing data state is real. The graph runs perfectly on a fresh install. On a workspace that already had pipeline contacts from the LLM-loop era, the cheap dedup pass missed them (those rows had no apollo_person_id), and the next checkpoint — a post-enrichment dedup by email — had to be added to bridge legacy data. Migrations and seeded fixtures aren’t free. If we’d thought about the data-state question earlier, we’d have shipped the email checkpoint as part of v1.

Why this matters beyond cold-leads

The cold-leads decomposition is one score. The same vocabulary — deterministic nodes, LLM nodes, map_over, ports, transforms — composes the next score, and the next, and the one a customer authors themselves in the Composer when it ships. We didn’t add a new architecture piece; we used what was already there.

That’s the real bet. A small set of node kinds plus a working orchestrator is enough to express most real outbound, inbound, and research-and-outreach workflows. The bottleneck isn’t the architecture — it’s the size of the skill catalog, and the absence (today) of a visual editor.

Both of those are coming. The graph that powers cold-leads exists in the database as score_nodes + score_edges rows. Rendering it as a React Flow canvas — read-only first, then editable — is the next step. After that, customers compose their own.

Summary

We replaced a 130-line LLM-driven prompt with a 10-node graph. Same workload, 9× cheaper, faster, with better drafts. The architecture is genuinely flexible and ports to the next score with no new pieces. The Composer that lets operators build their own scores has the data model it needs.

Next post: how the Composer renders that graph as something an operator can edit, and what trade-offs surface when you let people drag boxes around to compose AI workflows.


← Back to all posts