Tag: EnterpriseAI

  • Governing AI in Production: The Complete AWS Playbook

    Functional AI is easy. Governed AI is hard. This three-part series covers everything in between.

    Over three consecutive weeks, I built a complete framework for running Claude Sonnet on Amazon Bedrock in production – not just deploying it.

    1. The foundations: Shields (guardrails) and Eyes (observability).
    2. Build a feedback loop that detects problems, signals the right team, acts intelligently, and learns over time.
    3. What does this stack actually look like in AWS?

    Contents

    1. The Shields: Guardrails for Amazon Bedrock
    2. The Eyes: Gen AI Observability on AWS
    3. Building the Feedback Loop: Detect → Signal → Act → Learn
    4. AI Production Monitoring: The Four-Layer AWS Stack
    5. The CloudWatch Queries That Power the Quality Dashboard
    6. Production Readiness — Five Criteria 

    Chapter 1: The Shields

    Guardrails as architectural decisions, not afterthoughts

    Enterprise AI adoption is accelerating across four major patterns: internal copilots and knowledge assistants, developer tools like Claude Code, public-facing chatbots, and applications embedding AI into workflows. But production AI is exposing a reality many organisations underestimated.

    Functional AI is easy. Governed AI is hard.

    Guardrails are evolving far beyond simple content moderation filters. On Amazon Bedrock, they now represent a full policy enforcement layer with six distinct control types — each addressing a different failure mode in production AI systems.

    The Six Guardrail Types

    Control type 1

    Content filters

    Block hate, violence, sexual content, and profanity at inference time. Configurable thresholds per use case.

    Control type 2

    Topic denial

    Define topics the model should never engage with — competitors, financial advice, legal counsel — regardless of how the user frames the request.

    Control type 3

    Word filters

    Block or flag specific terms — brand names, regulatory language, internal project names — in both input and output.

    Control type 4

    Sensitive info filters

    Detect and mask PII — credit cards, social security numbers, passport numbers — before they appear in model responses.

    Control type 5

    Contextual grounding

    Measure faithfulness between the model’s response and the retrieved knowledge source. The primary defence against RAG hallucination.

    Control type 6

    Automated reasoning

    Validate logical consistency and factual correctness against defined rules and policy documents — beyond simple content matching.

    Guardrails Are Policy, Not Configuration

    The key shift in thinking: guardrails are not a feature you turn on at the end of a project. They are an architectural decision made at the start, because each control type needs to be configured per use case, tuned against real traffic, and treated as living policy rather than a one-time checkbox.

    A customer service chatbot needs different topic denial rules than an internal HR assistant. A RAG-based knowledge system needs grounding checks tuned to its retrieval quality. A developer tool needs word filters calibrated differently than a public-facing product.

    Critical distinction

    Guardrails apply at the inference layer. In a multi-agent system, the orchestrator’s output becomes the sub-agent’s input prompt. A guardrail on the sub-agent’s Bedrock call doesn’t see what the orchestrator decided upstream. Policy enforcement must be designed for the chain, not just the leaf node.

    Chapter 1 — Key takeaways

    • Six guardrail types cover content, topics, words, PII, grounding, and reasoning
    • Configure per use case — defaults are starting points, not production settings
    • In multi-agent systems, enforce policy at every node, not just the edge
    • Treat guardrail configuration as living policy that must be reviewed as traffic evolves

    Chapter 2: The Eyes

    Gen AI Observability and why it’s categorically different from application monitoring

    Traditional application monitoring asks: is the service up, is latency acceptable, are errors within threshold? These questions are necessary but not sufficient for AI systems. The failure modes are probabilistic. Quality degradation is often invisible to latency and error-rate monitors. A model that drifts from its system prompt won’t throw a 500 error — it will quietly produce subtly wrong outputs at scale.

    A shield with no eyes is blind defence. Eyes with no shield are aware but defenceless. Together they form the complete production contract.

    What Gen AI Observability Covers

    Signal type 1

    Invocation logging

    Every prompt, response, token count, latency, model ID, and guardrail action — captured and routed to CloudWatch Logs and S3.

    Signal type 2

    CloudWatch metrics

    Invocation count, latency percentiles, error rates, throttle rates — the operational heartbeat of your AI system.

    Signal type 3

    Guardrail signals

    Intervention rates, blocked topic trends, filter trigger frequency — the policy health layer.

    Signal type 4

    Trace and span

    End-to-end visibility across agent steps — every LLM call, tool invocation, RAG retrieval, and memory access in sequence.

    AI Ops: The Difference Between Deployment and Production

    Productionising AI is categorically different from productionising traditional software. Quality degradation is often invisible without deliberate instrumentation. AI Ops is what makes the difference between a deployment — something running — and a production system — something you can trust, maintain, and iterate on safely.

    Amazon Bedrock AgentCore Observability, powered by AWS Distro for OpenTelemetry (ADOT), auto-instruments agents built on Strands, LangGraph, or CrewAI with zero code changes. Every LLM call, tool invocation, memory access, and session gets traced end-to-end and lands in CloudWatch.

    Chapter 2 — Key takeaways

    • Enable model invocation logging from day one — it’s off by default
    • Four signal types: invocation logs, CloudWatch metrics, guardrail signals, trace and span
    • ADOT auto-instruments Strands, LangGraph, and CrewAI agents without code changes
    • CloudWatch speaks OpenTelemetry — route to Datadog, Langfuse, or LangSmith without re-instrumenting

    Chapter 3: Building the Feedback Loop

    Detect → Signal → Act → Learn

    Shields protect. Eyes observe. But production AI cannot stop at observation. The next maturity step is the feedback loop — a system that doesn’t just watch what’s happening, but converts observations into calibrated responses and feeds learning back into policy.

    Detect — Building the Evidence Trail

    Without a complete evidence trail, AI debugging becomes guesswork. Every interaction should capture:

    Identity layer

    System prompt version · Model version · Guardrail version · Knowledge source version

    Interaction layer

    User input · Model response · Input/output guardrail result · Final action taken

    Performance layer

    Latency · Token usage · Session context · Retry count

    Outcome layer

    User feedback · Escalation trigger · Downstream action · Tool call result

    Signal — Converting Logs Into Operational Intelligence

    Raw logs are not signals. A signal is a derived metric that carries operational meaning — something a human or automated system can act on. Five signals matter most in production:

    SignalWhat it measuresWhen to compute
    Hallucination riskIs the answer grounded in trusted knowledge, or unsupported?Pre-release in lower environments and in production
    Prompt divergenceDid the response follow system prompt — role, tone, refusal rules, citation rules?Pre-release and production monitoring
    Bias signalDo synthetic tests show different outcomes by demographic group?Pre-release testing with synthetic data
    Guardrail pressureHow often are guardrails triggered by user, session, topic, or time window?Real-time in production
    Cost and abuseAbnormal retries, token spikes, prompt attacks, long-running sessionsReal-time in production

    Act – Controlled Responses Per Signal

    Each signal should trigger a controlled action based on the detection environment and urgency. Detection without action is just expensive logging.

    SignalControlled response options
    Hallucination detectedRetrieve again → ask clarifying question → constrain the answer → escalate to human review
    Prompt divergence detectedRegenerate with stricter wrapper → use controlled response template → block the response
    Bias detectedBlock release → review test cases → adjust prompt examples → update guardrails → escalate for human review
    High guardrail hit rateApply stricter guardrails → slow the session → limit tool access → trigger abuse handling
    Cost/abuse patternDynamic rate limiting → reduce max tokens → shorten context → route to cheaper model

    Learn — Closing the Loop

    The feedback loop only earns its name if observation changes behaviour. The Eyes calibrate the Shields. Hit rate tells you if your guardrails are under-tuned or over-tuned. System prompt divergence tells you if they’re being bypassed. Each signal should feed back into policy calibration — otherwise you’re reacting without learning.

    This means maintaining version history for system prompts and guardrail configurations, tagging every incident to the policy version active at the time, and reviewing calibration on a cadence — not just when something breaks.

    Chapter 3 — Key takeaways

    • Build the evidence trail from day one — version every component that influences model behaviour
    • Five signals: hallucination, prompt divergence, bias, guardrail pressure, cost/abuse
    • Signals 1–3 can be pre-release in lower environments; 4–5 are real-time production signals
    • Each signal needs a pre-defined controlled response — automated where possible, human escalation where not
    • The loop only closes when observation feeds back into policy calibration

    AI Production Monitoring

    The four-layer AWS stack and the CloudWatch queries that make it real

    There’s a critical distinction most teams miss when they move from pilot to production. AI monitoring asks: is the model running? AI observability asks: is the model behaving the way it was approved to behave? Those are not the same question — and in 2026, the second question is the one that matters to your risk team, compliance function, and board.

    The Four-Layer AWS Stack

    Layer 1

    Amazon Bedrock — inference + guardrails

    Every inference call passes through content filters, grounding checks, automated reasoning, and topic denial. These aren’t just safety controls — they’re your primary signal source. The namespace for guardrail metrics in CloudWatch is AWS/Bedrock/Guardrails.

    Layer 2

    AgentCore observability

    AWS Distro for OpenTelemetry (ADOT) auto-instruments agents — Strands, LangGraph, CrewAI — with zero code changes. Every LLM call, tool invocation, memory access, and session gets traced end-to-end and emits telemetry in OTEL-compatible format.

    Layer 3

    CloudWatch GenAI dashboards

    Two pre-built views out of the box: Model Invocations (token usage, latency, error rates) and Bedrock AgentCore (agent fleet, sessions, traces). Add a third — a quality dashboard — for the signals that matter operationally: guardrail hit rate, hallucination trend, system prompt divergence.

    Layer 4

    OTEL ecosystem integrations

    Because CloudWatch speaks OpenTelemetry, you can route telemetry to Datadog, Langfuse, LangSmith, or Arize without re-instrumenting. NTT DATA’s production deployment — pairing AgentCore with Datadog LLM Observability via OTEL — is the clearest enterprise example currently documented.

    The CloudWatch Queries That Power Quality Dashboard

    The first two dashboards come pre-built. The quality dashboard you have to build yourself — but the raw material is already in your CloudWatch Logs once model invocation logging is enabled. Here are the four queries that matter most.

    1. Guardrail hit rate

    The ratio that tells you if your policies are being tested — and whether your guardrail configuration is calibrated to your actual traffic.

    CloudWatch Logs Insights — Guardrail hit rate

    # Requires: Model invocation logging enabled → CloudWatch Logs
    fields @timestamp,
    output.outputBodyJson.`amazon-bedrock-guardrailAction` as action
    | stats
    count(*) as total,
    count(action="INTERVENED") as intervened,
    count(action="INTERVENED") / count(*) * 100 as hit_rate_pct
    | filter ispresent(action)

    Set a CloudWatch Alarm on AWS/Bedrock/Guardrails > InvocationsIntervened when hit rate exceeds your defined threshold — that’s the trigger for your response playbook.

    2. Interventions by topic

    Which policy is firing most — and whether you have a content problem or a prompt-injection problem. They need different responses.

    CloudWatch Logs Insights — Interventions by category

    fields @timestamp, @message
    | filter category = "financial_advice"
    or category = "competitor_mention"
    or category = "pii_detected"
    | stats count(*) as hits by category
    | sort hits desc

    3. System prompt divergence

    Catching instances where the model produces unexpected stop conditions outside the normal guardrail flow — a signal that something is drifting from intended behaviour.

    CloudWatch Logs Insights — Prompt divergence indicator

    # Flags responses where stop_reason is neither 
    # end_turn nor guardrail-triggered — unexpected
    # model behaviour indicator
    fields @timestamp,
    input.inputBodyJson.system as system_prompt,
    output.outputBodyJson.`amazon-bedrock-guardrailAction` as guardrail,
    output.outputBodyJson.stop_reason as stop_reason
    | filter guardrail = "NONE"
    and stop_reason != "end_turn"
    | stats count(*) as anomalies by bin(1h)
    | sort @timestamp desc

    Spikes in this query warrant immediate review – something is producing unexpected stop conditions outside normal guardrail flow.

    4. Token cost anomaly detection

    Multi-agent loops that go wrong don’t just produce bad outputs – they produce expensive ones. Token anomaly detection is your early warning system before the bill arrives.

    CloudWatch Logs Insights — Token anomaly detection

    fields @timestamp, modelId,
    inputTokenCount, outputTokenCount,
    (inputTokenCount + outputTokenCount) as totalTokens
    | stats
    avg(totalTokens) as avg_tokens,
    max(totalTokens) as max_tokens,
    stddev(totalTokens) as stddev_tokens
    by bin(1h), modelId
    | sort avg_tokens desc

    Honest limits

    AgentCore Memory, Gateway, Identity, and Built-in Tools don’t yet surface in the unified GenAI Observability dashboard — you’ll access those metrics directly in CloudWatch. In deeply integrated multi-agent systems, you’re still stitching dashboards together rather than getting a true single pane of glass. Expect this to improve significantly as AWS matures the offering.

    Chapter 6 — Production Readiness

    Five Criteria

    If you can’t tick all five, you’re in an extended pilot — not production

    1. Invocation logging enabled and routed to CloudWatch. It’s off by default. Every prompt, response, and guardrail action should have a paper trail from day one.
    2. Alarms set on error rate, latency, and throttle. CloudWatch alarms on InvocationsIntervenedInvocationLatency, and InvocationThrottles — with SNS notifications to the right team.
    3. Guardrail hit rate tracked as a named metric. Not buried in logs. A dedicated metric, visible on a dashboard, with a threshold that triggers your response playbook.
    4. At least one quality signal automated. Manual review does not scale. Hallucination detection, prompt divergence monitoring, or bias signal computation should run automatically — not on request.
    5. A runbook exists for the three response actions. Rate limit, reroute to a different model, escalate to human review. Documented, tested, and owned by a named person before you go live.