Skills and tools
The word “skill” does double duty in Maestro and the overload is worth getting straight:
- Skill (package) — the unit of distribution. Everything under
skills/catalog/<name>/is “the foo skill” or “the foo skill package.”gmail,apollo,pipeline,notify,compose— all of these are skill packages, regardless of what their operations look like internally. - Tool vs. skill (operation kind) — the conceptual category each operation falls into. A tool is deterministic (send a Gmail message, write a row to the contacts table). A skill operation in this sense is LLM-backed and non-deterministic (draft an opener, classify a reply’s intent).
So a skill package can contain a mix of tool operations and LLM-backed operations. Most of our packages are pure tool — gmail, apollo, pipeline, notify. The compose package is pure LLM-backed. None of our v1 packages mix the two within one package, but the architecture supports it (a hypothetical gmail-with-summarize could).
Knowing which kind an operation is matters for three reasons: cost forecasting, debugging, and trust. When something goes wrong in a run, the first question is “was the failure in a tool or an LLM-backed op?” — they fail differently and you fix them differently.
Anatomy of a skill package
A skill package contains:
- A manifest (
manifest.yaml) declaring the package’s name, version, required secrets, and concurrency limits. - One or more operations — methods on a class decorated with
@skill(...). Each operation method is decorated with@operation. Each operation has a kind (toolby default;llmwhen LLM-backed). - Optional icon and description for the catalog UI.
The Maestro UI surfaces all operations in the Tools catalog and surfaces the packages they belong to in the Skills catalog. When a run executes, the timeline pill is colored by kind so you can see at a glance which steps were LLM-driven.
A note on the
kindannotation. The@operation(kind="tool" | "llm")annotation drives the run timeline’s pill colors. Today the kind is captured in operation descriptions and is informational; the colored-pill UI lands in an upcoming release.
Why one package, two pill colors
Bundling the deterministic and LLM-backed operations of a domain together (e.g. Gmail’s reads + sends could live alongside an LLM-backed Gmail summarizer) keeps the credentials in one place and the related code colocated. Splitting them by kind at the operation level keeps the cost/trust distinction visible in every run.
Where skills live
skills/
├── sdk/ # Decorators, registry, secret resolution
│ └── src/maestro_skills/
└── catalog/ # Shipping skill packages
├── http/ # Built-in HTTP escape hatch
├── web-research/ # Tavily-backed search and extract
└── gmail/ # OAuth + read/send/label
The SDK is published to the runtime as a Python package; the catalog directory is scanned at boot for skill.toml manifests. Adding a skill is dropping a directory under skills/catalog/.
Anatomy of an operation
from maestro_skills import skill, operation
from pydantic import BaseModel
class ListInboxIn(BaseModel):
query: str = ""
max_results: int = 25
@skill(name="gmail", version="0.1.0")
class Gmail:
@operation(id="list_inbox", kind="tool")
async def list_inbox(self, input: ListInboxIn) -> list[dict]:
"""Return recent threads matching `query`."""
...
Three things to notice:
- JSON Schema is generated from the Pydantic input model. The agent doesn’t see Python types — it sees the JSON Schema and decides what to pass.
kind="tool"means the timeline renders a deterministic-color pill. Usekind="llm"for operations that call a model.- Secrets are not parameters. They are resolved from the workspace’s vault by the SDK’s
Secretsclass — the LLM never sees them and can’t accidentally leak them in a tool call.
The full author guide lives at Skills overview.
Why this matters at runtime
When a run step is recorded, it carries a kind field (llm, skill_op, or decision). The dashboard renders these as differently-colored pills:
- LLM steps are the cost ledger. If a run cost more than expected, the pill colors tell you whether the volume was in cheap tool calls or expensive model calls.
- Tool steps are the determinism ledger. If a run produced a bad result, the pill colors tell you whether the failure was an API quirk (deterministic) or a model hallucination (non-deterministic).
The split also gives the run timeline a “shape.” A healthy cold-leads run looks like: tool (find leads) → tool (enrich) → llm (draft openers) → tool (queue sends). Anything wildly different from that shape is worth investigating.
What about MCP, function calling, “tools”?
Anthropic’s API has a tools parameter that lets the model call functions. Maestro uses it. The tools Maestro passes to the model are the JSON Schemas of the operations the agent has access to — both deterministic and LLM-backed alike, since from the model’s perspective they’re all just callable functions.
The conceptual split (tool vs. skill) is Maestro’s abstraction for cost and trust. The wire-protocol detail (Anthropic tools) is unrelated. Both happen to be called “tools” in their respective contexts — context disambiguates which one is meant.
Related
- Skills overview — how to build a skill.
- Secrets — how skills access credentials.
- Pipelines — where skill outputs end up.