Annie meets pup: turning intent into audited Datadog runbooks

We write about the tools in Anyshift's ecosystem: the CLIs and platforms that Annie integrates with. This one is about pup, Datadog Labs' agent-first CLI.

Annie's job, until now, has been to investigate. You ask a question (why is checkout slow, what depends on this service, what broke at 2am) and she traverses the versioned infrastructure graph, reads the logs, pulls the Sentry history, and hands back a diagnosis. Good. But you still have to act on it: switch to Datadog, find the right monitor, schedule a downtime, switch back to your terminal.

Introducing annie do, with Datadog's pup

annie do is the answer to that last step. It's an internal build of the Annie CLI for engineers who run both Anyshift and Datadog, and it turns plain-English intent into audited, reviewable runbooks executed through pup, Datadog Labs' agent-first CLI.

Terminal running annie do for a neo4j upgrade that mutes every downstream service for five minutes. It prints a nine-step plan — compute the window, then write a downtime body and mute each of service:api, annie-intelligence, graph-connector and slackbot — then prompts Proceed and reports Done with 9 steps, saving the runbook under ~/.annie/runbooks/.

One plain-English request becomes a numbered, reviewable plan. Approve once, and the runbook lands on disk.

The split is deliberate. annie ask is for understanding; annie do is for acting. Different jobs, different shapes of output, different safety properties.

The shape of the problem

The obvious way to make an LLM "do things" is to put your Datadog API key in the agent's environment and let it call the API directly. We didn't want that, for three reasons:

1. Audit trail. When an LLM fires a POST /api/v2/downtime call, there's no artifact to review. No diff, no PR, no record of why. For anything that mutes alerts or declares incidents, that's a non-starter.

2. Trust boundary. Every additional system that holds your Datadog keys is a system that can leak them. We wanted Annie to stay credential-free by design.

3. Postmortem-grade evidence. A one-shot API call evaporates. When you're reconstructing what happened during the 02:00 incident, you want a file you can paste into the Slack channel: the exact request, with timestamps, in a format anyone can read.

The thing on disk is the audit trail.

The thing pup alone couldn't do

pup can mute a service by its tag. Datadog can show you the APM service map. Neither answers the question an operator actually has at the start of a maintenance window: what else am I about to break?

annie do "we're upgrading neo4j, mute every service downstream of it for 5 minutes" triggers the blast-radius path. Before it touches Datadog, Annie investigates the same way she would for an annie ask:

The versioned infrastructure graph for explicit dependency edges
IaC: grepping NEO4J_URI references across *.tf files
Sentry: which issue families spike when neo4j is in trouble?
Deployment configs and recent incident history

The generated runbook opens with the evidence she gathered, embedded as comments at the top of the file:

# Resolved 4 muteable dependent(s) of neo4j-production:
#   - anyshift-backend (service:api): NEO4J_URI in back-backend.tf;
#         Sentry API-BTA / API-BT9 spike on every neo4j outage
#   - annie-intelligence (service:annie-intelligence): NEO4J_URI in
#         back-annie-intelligence.tf; Sentry ANNIE-INTELLIGENCE-1SP
#         'Failed to delete msg' correlates with neo4j TCP abort storms
#   - graph-connector (service:graph-connector): Primary neo4j writer;
#         Sentry GRAPH-CONNECTOR-QW8/QWY/QWV/QWZ/QWW → DLQ on unavailability
#   - anyshift-slackbot (service:slackbot): Indirect via anyshift-backend;
#         zero NEO4J_* vars in back-slackbot.tf — cascade only;
#         Sentry SLACKBOT-14V+14T+14M during neo4j storms
# Skipped 1 dependent without a Datadog service tag (operator handles separately):
#   - deepeval: direct neo4j connection in deepeval.tf, but no DD service tag

Real Sentry issue codes, real file references, the cascade reasoning for slackbot (indirect, via the backend), and the honest skip for deepeval (a dependent, but no Datadog service tag, so the operator handles it separately). None of it comes from one tool. The value is in the synthesis.

The handoff between Annie and pup

When you run annie do, two distinct processes run in sequence, each doing the part it's actually good at.

Phase 1 — Annie investigates. Starting from your request, Annie uses the versioned infrastructure graph as the canonical entry point: a Cypher traversal from the named resource finds every direct and transitive dependent. Where the graph has gaps, she fills them in by searching IaC for connection strings (NEO4J_URI, REDIS_URL), reading /health endpoints, and correlating the Sentry issue families that recur during that resource's outages. She returns a single structured JSON object: the target resource, the list of affected services, and the evidence chain for each.

Phase 2 — annie-cli renders the runbook. The Go layer takes that JSON and deterministically templates it into a multi-step YAML runbook, 2N+1 steps for N services. No LLM involved here. The iteration is exact, the output is consistent, and the operator can read every byte before approving.

annie do "..."
   │
   ▼  Phase 1 — Annie (LLM): graph traversal + IaC search + Sentry correlation
   │           → one JSON object { resource, services:[{name, tag, evidence}] }
   │
   ▼  Phase 2 — annie-cli (Go, no LLM): render the mute YAML deterministically
   │           → 2N+1 steps for N services
   │
   ▼  pup: import + validate + execute

Annie is good at the open-ended synthesis: investigate, weigh evidence, map service names to Datadog tags. She's less reliable at the closed-ended part: emitting N parallel YAML blocks with identical structure. So we let each side do what it's good at. Annie returns one structured object; annie-cli templates the YAML from it in Go, where loops are loops.

Annie never touches Datadog. pup never touches Annie. The YAML on disk is the contract between them.

Vim showing the generated runbook YAML: per-service shell steps that write a Datadog downtime body to a temp file, each followed by a pup step that runs downtime create against it.

The runbook annie-cli rendered: one shell step writes each downtime body, one pup step pushes it. Reviewable before a single monitor is muted.

And it lands in Datadog. The canary monitor goes muted for the window, scope and remaining time visible, exactly as the runbook described:

Datadog monitor list showing the anyshift-mute-test canary in a muted state, with a popover reading Muted Scope: All Monitor Scopes, Elapsed 13 seconds, Remaining 4 minutes.

Where it's going

annie do ships with a curated set of actions today (single-service mute, blast-radius mute, listing), but the pattern is additive. New action categories slot into the same shape: have the LLM investigate and return structured evidence, then render the runbook in code. Each new capability is an entry in an allowlist, a small Go template, and a worked example.

annie ask and annie do are two faces of the same agent: one traverses a graph plus code, errors, and incident history to understand infrastructure, and one emits a runbook to change it. The graph is the same graph. The trust boundary is the same trust boundary. The artifact you get out of annie do slots cleanly into the same postmortem Annie helps you write with annie ask.