A data pipeline runs on Temporal. Workflows orchestrate ingestion, transformation, and loading. Everything has been humming along for weeks. Then one morning, the dashboard shows stale data. The numbers haven't moved since last night.
The on-call engineer opens the Temporal UI. Workflows are stuck in "Running", none of them completing. The task queue shows zero pollers. No workers are picking up tasks.
The natural instinct: something is wrong with Temporal.
But is it?
Spoiler: this is not a complex outage. The root cause is almost embarrassingly simple. But that's exactly what makes it a good example. Most production incidents aren't exotic. They're straightforward problems hiding behind a layer of indirection. The interesting question isn't what broke, it's how fast can you find it.
The setup
The pipeline is straightforward. A Go worker registers a GreetingWorkflow on a task queue called greeting-queue. The workflow executes a single activity, returns a result, and completes. The worker runs as a Kubernetes Deployment in the same cluster as the Temporal server.
When everything works, the flow looks like this:
1. A workflow is started on the greeting-queue task queue
2. The worker picks it up within milliseconds
3. The activity executes and returns
4. The workflow completes

This is the healthy state. The worker is polling, workflows complete in under a second, and the dashboard updates in real time.
The silence
Now something changes. A new deployment rolls out. The dashboard goes stale. The engineer opens the Temporal UI and starts a workflow manually. It times out. The task is scheduled but never picked up.

The task queue page still shows a poller. Temporal keeps reporting the last known worker for up to five minutes after it stops polling. Everything looks fine on the Temporal side. The server is working correctly: it scheduled the task and is patiently waiting for a worker to claim it.
This is the trap. Temporal is doing exactly what it should. The problem is elsewhere.
Digging into Kubernetes
The engineer switches to kubectl. The worker pod exists, but something is off:
$ kubectl get pods -n temporal -l app=greeting-worker
NAME READY STATUS RESTARTS AGE
greeting-worker-5cd897fbcb-w5b58 0/1 CrashLoopBackOff 5 (30s ago) 3mCrashLoopBackOff. The container is crashing on every start. Kubernetes keeps restarting it, but it fails every time. The backoff delay grows longer with each attempt (10s, 20s, 40s) and the worker never stays up long enough to register on the task queue.
The logs reveal the cause:
$ kubectl logs -l app=greeting-worker -n temporal
2026/04/07 08:15:42 Simulated startup crash (CRASH_ON_START=true)An environment variable (CRASH_ON_START=true) was set in the latest deployment. The worker crashes immediately on startup, before it ever connects to Temporal. It never registers as a poller. From the Temporal side, the task queue is simply empty.
The fix is trivial once found: remove the bad environment variable, redeploy, and the worker comes back online. Workflows resume. The dashboard updates.
A crashlooping pod. That's it. Not a networking partition, not a Temporal bug, not a resource limit, just a bad env var. The kind of thing any engineer would fix in seconds if they knew where to look.
But the investigation still took 10 minutes. Opening the Temporal UI, ruling out server issues, switching to kubectl, finding the right namespace, describing the pod, reading the logs. Each step requires context-switching between tools and mental models. For a trivial root cause, that's a lot of overhead.
What Annie sees
This is where it gets interesting. Not because the problem is hard, but because it doesn't have to feel hard.
With AnyShift connected to the cluster, the same incident plays out differently. The engineer notices stale data on the dashboard. Instead of jumping between tools, they ask Annie:
"Why is my pipeline broken?"
Annie already has the full picture. She has been watching the Kubernetes cluster continuously, tracking pod states, events, restarts, and resource usage. She correlates what she sees with the infrastructure graph.

Within seconds, Annie responds:
Root Cause: `greeting-worker` pods in CrashLoopBackOff
Your
greeting-workerpods are crashing on startup. The container exits immediately with a fatal error related to a misconfigured environment variable. Since no healthy workers are running, no one is polling thegreeting-queuetask queue, so workflows are queuing up and timing out.-
greeting-worker-6f58c576df-tbhjf: crashing since 08:36 UTC-
greeting-worker-6f58c576df-2vvz2: crashing since 08:55 UTCRecommended fix: check the environment variables on the
greeting-workerDeployment. The latest rollout introduced a configuration change that is causing the container to exit before connecting to Temporal.
One question, one answer. No tab-switching. No guessing which namespace to check. No scrolling through kubectl describe output looking for the relevant line. Annie traced the symptom (stale data) back through the timed-out workflows, through the empty task queue, all the way to the crashing pods.

The takeaway
This was a simple incident. A crashlooping pod caused by a bad environment variable. Any engineer could have found it, and any engineer would have fixed it in seconds.
The point isn't that the problem was hard. It's that even simple problems take too long when the symptom and the cause live in different systems. Temporal tells you what is stuck. Kubernetes tells you why. Bridging the two means context-switching between UIs, CLIs, and mental models, all for a root cause that turns out to be one bad config line.
Annie eliminates that gap. She watches both sides continuously and connects symptoms to causes in seconds, so you can skip the investigation and go straight to the fix. The simpler the root cause, the more frustrating it is to spend 10 minutes hunting for it, and the more obvious the value of having someone who already knows the answer.
Want Annie investigating your infrastructure? Get started with AnyShift

