Functional AI is easy. Governed AI is hard. This three-part series covers everything in between.
Over three consecutive weeks, I built a complete framework for running Claude Sonnet on Amazon Bedrock in production – not just deploying it.
- The foundations: Shields (guardrails) and Eyes (observability).
- Build a feedback loop that detects problems, signals the right team, acts intelligently, and learns over time.
- What does this stack actually look like in AWS?
Contents
- The Shields: Guardrails for Amazon Bedrock
- The Eyes: Gen AI Observability on AWS
- Building the Feedback Loop: Detect → Signal → Act → Learn
- AI Production Monitoring: The Four-Layer AWS Stack
- The CloudWatch Queries That Power the Quality Dashboard
- Production Readiness — Five Criteria
Chapter 1: The Shields
Guardrails as architectural decisions, not afterthoughts
Enterprise AI adoption is accelerating across four major patterns: internal copilots and knowledge assistants, developer tools like Claude Code, public-facing chatbots, and applications embedding AI into workflows. But production AI is exposing a reality many organisations underestimated.
Functional AI is easy. Governed AI is hard.
Guardrails are evolving far beyond simple content moderation filters. On Amazon Bedrock, they now represent a full policy enforcement layer with six distinct control types — each addressing a different failure mode in production AI systems.
The Six Guardrail Types
Control type 1
Content filters
Block hate, violence, sexual content, and profanity at inference time. Configurable thresholds per use case.
Control type 2
Topic denial
Define topics the model should never engage with — competitors, financial advice, legal counsel — regardless of how the user frames the request.
Control type 3
Word filters
Block or flag specific terms — brand names, regulatory language, internal project names — in both input and output.
Control type 4
Sensitive info filters
Detect and mask PII — credit cards, social security numbers, passport numbers — before they appear in model responses.
Control type 5
Contextual grounding
Measure faithfulness between the model’s response and the retrieved knowledge source. The primary defence against RAG hallucination.
Control type 6
Automated reasoning
Validate logical consistency and factual correctness against defined rules and policy documents — beyond simple content matching.
Guardrails Are Policy, Not Configuration
The key shift in thinking: guardrails are not a feature you turn on at the end of a project. They are an architectural decision made at the start, because each control type needs to be configured per use case, tuned against real traffic, and treated as living policy rather than a one-time checkbox.
A customer service chatbot needs different topic denial rules than an internal HR assistant. A RAG-based knowledge system needs grounding checks tuned to its retrieval quality. A developer tool needs word filters calibrated differently than a public-facing product.
Critical distinction
Guardrails apply at the inference layer. In a multi-agent system, the orchestrator’s output becomes the sub-agent’s input prompt. A guardrail on the sub-agent’s Bedrock call doesn’t see what the orchestrator decided upstream. Policy enforcement must be designed for the chain, not just the leaf node.
Chapter 1 — Key takeaways
- Six guardrail types cover content, topics, words, PII, grounding, and reasoning
- Configure per use case — defaults are starting points, not production settings
- In multi-agent systems, enforce policy at every node, not just the edge
- Treat guardrail configuration as living policy that must be reviewed as traffic evolves
Chapter 2: The Eyes
Gen AI Observability and why it’s categorically different from application monitoring
Traditional application monitoring asks: is the service up, is latency acceptable, are errors within threshold? These questions are necessary but not sufficient for AI systems. The failure modes are probabilistic. Quality degradation is often invisible to latency and error-rate monitors. A model that drifts from its system prompt won’t throw a 500 error — it will quietly produce subtly wrong outputs at scale.
A shield with no eyes is blind defence. Eyes with no shield are aware but defenceless. Together they form the complete production contract.
What Gen AI Observability Covers
Signal type 1
Invocation logging
Every prompt, response, token count, latency, model ID, and guardrail action — captured and routed to CloudWatch Logs and S3.
Signal type 2
CloudWatch metrics
Invocation count, latency percentiles, error rates, throttle rates — the operational heartbeat of your AI system.
Signal type 3
Guardrail signals
Intervention rates, blocked topic trends, filter trigger frequency — the policy health layer.
Signal type 4
Trace and span
End-to-end visibility across agent steps — every LLM call, tool invocation, RAG retrieval, and memory access in sequence.
AI Ops: The Difference Between Deployment and Production
Productionising AI is categorically different from productionising traditional software. Quality degradation is often invisible without deliberate instrumentation. AI Ops is what makes the difference between a deployment — something running — and a production system — something you can trust, maintain, and iterate on safely.
Amazon Bedrock AgentCore Observability, powered by AWS Distro for OpenTelemetry (ADOT), auto-instruments agents built on Strands, LangGraph, or CrewAI with zero code changes. Every LLM call, tool invocation, memory access, and session gets traced end-to-end and lands in CloudWatch.
Chapter 2 — Key takeaways
- Enable model invocation logging from day one — it’s off by default
- Four signal types: invocation logs, CloudWatch metrics, guardrail signals, trace and span
- ADOT auto-instruments Strands, LangGraph, and CrewAI agents without code changes
- CloudWatch speaks OpenTelemetry — route to Datadog, Langfuse, or LangSmith without re-instrumenting
Chapter 3: Building the Feedback Loop
Detect → Signal → Act → Learn
Shields protect. Eyes observe. But production AI cannot stop at observation. The next maturity step is the feedback loop — a system that doesn’t just watch what’s happening, but converts observations into calibrated responses and feeds learning back into policy.
Detect — Building the Evidence Trail
Without a complete evidence trail, AI debugging becomes guesswork. Every interaction should capture:
Identity layer
System prompt version · Model version · Guardrail version · Knowledge source version
Interaction layer
User input · Model response · Input/output guardrail result · Final action taken
Performance layer
Latency · Token usage · Session context · Retry count
Outcome layer
User feedback · Escalation trigger · Downstream action · Tool call result
Signal — Converting Logs Into Operational Intelligence
Raw logs are not signals. A signal is a derived metric that carries operational meaning — something a human or automated system can act on. Five signals matter most in production:
| Signal | What it measures | When to compute |
|---|---|---|
| Hallucination risk | Is the answer grounded in trusted knowledge, or unsupported? | Pre-release in lower environments and in production |
| Prompt divergence | Did the response follow system prompt — role, tone, refusal rules, citation rules? | Pre-release and production monitoring |
| Bias signal | Do synthetic tests show different outcomes by demographic group? | Pre-release testing with synthetic data |
| Guardrail pressure | How often are guardrails triggered by user, session, topic, or time window? | Real-time in production |
| Cost and abuse | Abnormal retries, token spikes, prompt attacks, long-running sessions | Real-time in production |
Act – Controlled Responses Per Signal
Each signal should trigger a controlled action based on the detection environment and urgency. Detection without action is just expensive logging.
| Signal | Controlled response options |
|---|---|
| Hallucination detected | Retrieve again → ask clarifying question → constrain the answer → escalate to human review |
| Prompt divergence detected | Regenerate with stricter wrapper → use controlled response template → block the response |
| Bias detected | Block release → review test cases → adjust prompt examples → update guardrails → escalate for human review |
| High guardrail hit rate | Apply stricter guardrails → slow the session → limit tool access → trigger abuse handling |
| Cost/abuse pattern | Dynamic rate limiting → reduce max tokens → shorten context → route to cheaper model |
Learn — Closing the Loop
The feedback loop only earns its name if observation changes behaviour. The Eyes calibrate the Shields. Hit rate tells you if your guardrails are under-tuned or over-tuned. System prompt divergence tells you if they’re being bypassed. Each signal should feed back into policy calibration — otherwise you’re reacting without learning.
This means maintaining version history for system prompts and guardrail configurations, tagging every incident to the policy version active at the time, and reviewing calibration on a cadence — not just when something breaks.
Chapter 3 — Key takeaways
- Build the evidence trail from day one — version every component that influences model behaviour
- Five signals: hallucination, prompt divergence, bias, guardrail pressure, cost/abuse
- Signals 1–3 can be pre-release in lower environments; 4–5 are real-time production signals
- Each signal needs a pre-defined controlled response — automated where possible, human escalation where not
- The loop only closes when observation feeds back into policy calibration
AI Production Monitoring
The four-layer AWS stack and the CloudWatch queries that make it real
There’s a critical distinction most teams miss when they move from pilot to production. AI monitoring asks: is the model running? AI observability asks: is the model behaving the way it was approved to behave? Those are not the same question — and in 2026, the second question is the one that matters to your risk team, compliance function, and board.
The Four-Layer AWS Stack
Layer 1
Amazon Bedrock — inference + guardrails
Every inference call passes through content filters, grounding checks, automated reasoning, and topic denial. These aren’t just safety controls — they’re your primary signal source. The namespace for guardrail metrics in CloudWatch is AWS/Bedrock/Guardrails.
Layer 2
AgentCore observability
AWS Distro for OpenTelemetry (ADOT) auto-instruments agents — Strands, LangGraph, CrewAI — with zero code changes. Every LLM call, tool invocation, memory access, and session gets traced end-to-end and emits telemetry in OTEL-compatible format.
Layer 3
CloudWatch GenAI dashboards
Two pre-built views out of the box: Model Invocations (token usage, latency, error rates) and Bedrock AgentCore (agent fleet, sessions, traces). Add a third — a quality dashboard — for the signals that matter operationally: guardrail hit rate, hallucination trend, system prompt divergence.
Layer 4
OTEL ecosystem integrations
Because CloudWatch speaks OpenTelemetry, you can route telemetry to Datadog, Langfuse, LangSmith, or Arize without re-instrumenting. NTT DATA’s production deployment — pairing AgentCore with Datadog LLM Observability via OTEL — is the clearest enterprise example currently documented.
The CloudWatch Queries That Power Quality Dashboard
The first two dashboards come pre-built. The quality dashboard you have to build yourself — but the raw material is already in your CloudWatch Logs once model invocation logging is enabled. Here are the four queries that matter most.
1. Guardrail hit rate
The ratio that tells you if your policies are being tested — and whether your guardrail configuration is calibrated to your actual traffic.
CloudWatch Logs Insights — Guardrail hit rate
# Requires: Model invocation logging enabled → CloudWatch Logs
fields @timestamp,
output.outputBodyJson.`amazon-bedrock-guardrailAction` as action
| stats
count(*) as total,
count(action="INTERVENED") as intervened,
count(action="INTERVENED") / count(*) * 100 as hit_rate_pct
| filter ispresent(action)
Set a CloudWatch Alarm on AWS/Bedrock/Guardrails > InvocationsIntervened when hit rate exceeds your defined threshold — that’s the trigger for your response playbook.
2. Interventions by topic
Which policy is firing most — and whether you have a content problem or a prompt-injection problem. They need different responses.
CloudWatch Logs Insights — Interventions by category
fields @timestamp, @message
| filter category = "financial_advice"
or category = "competitor_mention"
or category = "pii_detected"
| stats count(*) as hits by category
| sort hits desc
3. System prompt divergence
Catching instances where the model produces unexpected stop conditions outside the normal guardrail flow — a signal that something is drifting from intended behaviour.
CloudWatch Logs Insights — Prompt divergence indicator
# Flags responses where stop_reason is neither# end_turn norguardrail-triggered — unexpected
#model behaviour indicator
fields @timestamp,
input.inputBodyJson.system as system_prompt,
output.outputBodyJson.`amazon-bedrock-guardrailAction` as guardrail,
output.outputBodyJson.stop_reason as stop_reason
| filter guardrail = "NONE"
and stop_reason != "end_turn"
| stats count(*) as anomalies by bin(1h)
| sort @timestamp desc
Spikes in this query warrant immediate review – something is producing unexpected stop conditions outside normal guardrail flow.
4. Token cost anomaly detection
Multi-agent loops that go wrong don’t just produce bad outputs – they produce expensive ones. Token anomaly detection is your early warning system before the bill arrives.
CloudWatch Logs Insights — Token anomaly detection
fields @timestamp, modelId,
inputTokenCount, outputTokenCount,
(inputTokenCount + outputTokenCount) as totalTokens
| stats
avg(totalTokens) as avg_tokens,
max(totalTokens) as max_tokens,
stddev(totalTokens) as stddev_tokens
by bin(1h), modelId
| sort avg_tokens desc
Honest limits
AgentCore Memory, Gateway, Identity, and Built-in Tools don’t yet surface in the unified GenAI Observability dashboard — you’ll access those metrics directly in CloudWatch. In deeply integrated multi-agent systems, you’re still stitching dashboards together rather than getting a true single pane of glass. Expect this to improve significantly as AWS matures the offering.
Chapter 6 — Production Readiness
Five Criteria
If you can’t tick all five, you’re in an extended pilot — not production
- Invocation logging enabled and routed to CloudWatch. It’s off by default. Every prompt, response, and guardrail action should have a paper trail from day one.
- Alarms set on error rate, latency, and throttle. CloudWatch alarms on
InvocationsIntervened,InvocationLatency, andInvocationThrottles— with SNS notifications to the right team. - Guardrail hit rate tracked as a named metric. Not buried in logs. A dedicated metric, visible on a dashboard, with a threshold that triggers your response playbook.
- At least one quality signal automated. Manual review does not scale. Hallucination detection, prompt divergence monitoring, or bias signal computation should run automatically — not on request.
- A runbook exists for the three response actions. Rate limit, reroute to a different model, escalate to human review. Documented, tested, and owned by a named person before you go live.
Leave a comment