Chapter 9: Adapting software engineering observability practices for AI systems

Software engineering observability

Core ideas

Like a physician inferring brain health from external exams, you rarely get full visibility into model internals. Invest instead in rich telemetry around the AI so the system becomes observable and improvable.
Two lenses: Software engineering observability asks whether the system is operational and how it is behaving, versus data science observability (Chapter 10), which asks how well it is performing.
The chapter follows the AWS Observability Maturity Model as a staircase: Stage 1 foundational monitoring → Stage 2 core insights → Stage 3 advanced correlation → Stage 4 proactive and self-healing behavior. Maturity is never “done”; systems keep changing.
Telemetry (raw operational data) feeds monitoring across logs, traces, and metrics; for AI, prioritize capturing model inputs/outputs, configuration, tokens, cost, tool calls, and linkage metadata so logs are not orphaned noise.
Stage 2 stresses traffic, saturation, and errors, including rate-limit patterns, so failures are loud and debuggable, not silent.
Stage 3 unifies signals for multi-step AI reliability, AI-native security (misuse, prompt injection, multi-agent escalation), and accountability / economic control (who did what, spend tied to product and customer context).
Stage 4 adds synthetic canaries, predictive cost/quality signals, intelligent fallbacks (including brownout routing), and AIOps-style assistance, always paired with human judgment.
Observability is sociotechnical: integrate SRE habits with AI teams, shrink “shadow AI,” and build proactive culture (game days, blameless reviews, celebrating silent saves).

Principles from the chapter

If users are the primary alerting signal, the AI system is already in trouble.
Logs without context are noise.
Unwatched rate limits will become tomorrow’s outages.
The faster you can close the loop between failure and fix, the faster you can ship features.
Many teams will build mediocre AI systems because of the ease of setup, while many fewer will build excellent AI systems because of the challenge of continual improvement.
Since computers themselves cannot be held accountable, your observability system must be designed to identify which individuals are responsible for specific actions.

Read the chapter for…

OpenTelemetry framing, logging dimensions, trace setup guidance, golden signals and composite alerts (Bob and Jill), security and prompt-injection playbooks, audit stories, Joseph’s cost-velocity case, fallback behavior, and the proactive culture checklist.

All chapter digests · Framework: Observability · Buy on Amazon