Production Debugging
All articles filed under Production Debugging.
5 Articles
How to Trace a Production Incident Back to the Commit
Burned 25 minutes on a Friday-morning page before I realized the responsible commit was in another team's repo. This is the four-command sequence I now run when an alert lands and `git log` on my own service comes up empty, with the outputs at each step and where the search space gets cut.
My Workers Stopped Polling: a K8s + Temporal Whodunit
Temporal workflows stuck in Running with zero pollers, and Temporal still reports a healthy task queue. The root cause lives one layer down: a CrashLoopBackOff in the Kubernetes worker pod, caused by a single bad environment variable. A walkthrough of debugging Temporal workers on Kubernetes the manual way (10 minutes), then with an infrastructure context layer that bridges the two systems (seconds).
Common Weak Points in Infrastructure Management: An In-Depth Guide
Managing infrastructure at scale is a complex endeavor that demands meticulous planning, robust tooling, and continuous adaptation.
5 Key Reasons You're Struggling to Debug Your Infrastructure in Under an Hour
Most infrastructure debugging sessions blow past the one-hour mark for the same five structural reasons: scattered visibility across cloud accounts, missing historical state, terraform plan output that hides downstream impact, runbooks that lag the live infrastructure, and post-merger environments that no one has fully mapped. A walkthrough of each, with concrete examples and what reduces the time.
Top 3 Weak Points in Your Infrastructure and how to mitigate them
Three structural patterns recur in growing infrastructure orgs: single-repo bottlenecks where dozens of teams share one approval queue, ClickOps and dead IaC code that drift outside any state file, and module version fragmentation that quietly bypasses security patches. A walkthrough of each, with the practices that contain the blast radius.