Audit and Compliance Playbook

AI Agent Audit Trail Playbook

15 questions · 45–60 min · Rav | @MrDecentralize

Most organizations log everything the agent does. Regulators ask why it decided what it decided. Those are different questions, and most audit trails only answer one of them. The gap between "logged" and "auditable" is where regulatory exposure lives.

Who this is for

Compliance officers, legal teams, risk managers, and security leaders deploying AI agents in regulated contexts — financial services, healthcare, insurance, legal — who need to prove decision rationale to regulators, auditors, or in litigation.

What this playbook maps

Five critical gaps in audit trails for non-deterministic systems:

Time required
  • Read-through: 30–35 min
  • Self-audit for one agent system: 45–60 min
  • Gap remediation: 4–8 weeks
Expected outcomes
  • Whether you can reconstruct agent decision rationale
  • What evidence regulators will ask for that you don't capture
  • The difference between "logged" and "auditable"
  • Prioritized list of evidence gaps to close
What this is not

This is an audit framework, not an implementation guide. It reveals what evidence you're missing but does not provide retention architectures, replay bundle formats, or tamper-evident logging designs. For implementation methodology with templates and validation tests, see the Audit Trail Implementation Playbook.

What actually breaks in production

I was reviewing an AI agent deployment for credit decisioning at a community bank. The agent automated initial credit assessments for small business loans. The compliance team was proud of their audit infrastructure. "Everything is logged. We're fully auditable."

They showed me their logging dashboard. Timestamps, API calls, database queries, user actions, agent responses. Every interaction captured. Retention policy aligned with regulatory requirements. Good. That's what you show IT auditors.

Then I asked a softball question: "A loan officer approved a $250K line of credit for Joe's Auto Repair last Tuesday based on the agent's recommendation. Regulatory exam is next month. Show me how you'd reconstruct why the agent recommended approval for that specific application."

The gap revealed

They pulled up the log viewer. "Here's the session. Loan officer submitted application ID 847392. Agent returned: Approve, $250K, 8.5% rate." Good. "Now show me why it recommended approval." They listed what the agent accessed and when. I asked for the actual prompt sent to the model. Not retained. The specific excerpts from the financial statements it analyzed. Not captured — only document IDs. The model version running at 2:47 PM on Tuesday. Not logged. The policy version active for that decision. Not tracked.

They'd built a system that proved the agent made decisions. Not a system that explained why. Their audit trail showed input received, data accessed, decision returned. It didn't show what the agent saw, how it interpreted it, or why it concluded this way.

Why it matters

That credit decisioning agent made 10,000+ lending decisions over 6 months. The bank could prove all 10,000 happened. They couldn't explain why any specific one was approved or denied. Regulators don't ask "did your agent make this decision?" They ask "why did your agent make this decision?" Deterministic systems can reproduce results from inputs. Probabilistic systems cannot — they must replay the specific reasoning that occurred at that moment. And none of that was captured.

How to use this playbook

  1. Read through all five sections first without answering. This builds the mental model of what audit evidence actually requires for non-deterministic systems.
  2. Select one agent system to audit. Pick a production or near-production deployment with regulatory implications.
  3. Answer each question honestly. If you are uncertain, that is a Partial or Gap — not a reason to skip. Uncertainty about a control is itself a gap.
  4. Review your gap score. The results panel generates after question 15 with prioritized gaps and next steps.
  5. Prioritize Section 1 gaps first. Without prompt retention, context retention, and version stamping, replay is impossible regardless of what else you have.
Answer questions to generate your gap score
0 of 15 answered
Controlled: 0 Partial: 0 Gap: 0 Skipped: 0
Section 1 of 5 · 3 questions
Evidence Completeness
Do you capture the evidence needed to explain why, not just prove that it happened? Access logs and timestamps prove the agent ran. They don't prove what it saw, what it was instructed to do, or which version of the rules it applied.
Q1.1 Do you retain the exact prompt sent to the model for each decision — including dynamically generated content?

By "exact prompt": system prompt, user instruction, retrieved context included in prompt, tool call specifications, any dynamic content inserted at runtime, and the complete prompt string as sent to the model API. The common gap: logging the template but not the filled-in prompt. Dynamically generated sections vary per decision — the template documents intent, the actual prompt proves what the agent was instructed to do for that specific case. Without the exact prompt, you cannot prove what instructions the agent was operating under or verify the prompt didn't contain errors or unauthorized modifications.

Red flags
  • "We log that the agent ran, not the prompt"
  • "The prompt template is documented" — template is not the prompt
  • "We'd have to reconstruct the prompt from our code"
  • "Prompts are too large to store"
Gap identified. You cannot prove what instructions the agent was following for any specific decision. Regulatory inquiries cannot be answered with actual evidence — only with inference from templates and policy documents.
Q1.2 Do you retain the specific context the agent retrieved and analyzed — not just which systems it accessed?

Access logs show the agent queried Credit Bureau API at 3:42 PM and accessed financial statement documents 9284-FS-2023 and 9284-FS-2024. That proves access. What you need for auditability: the actual credit score returned, the specific paragraphs from the financial statements the agent analyzed, the API response content it reasoned over. Re-querying those systems now returns current data, not what the agent saw at decision time. Credit reports change. Financial statements are updated. The data the agent analyzed at 3:42 PM is not necessarily the data available today.

Red flags
  • "We log data source access, not retrieved content"
  • "We can re-query if needed" — re-query returns current data, not historical
  • "The data is still in the source systems"
  • "Retrieved context would be too much to store"
Gap identified. You cannot prove what data the agent actually analyzed. Re-querying source systems produces current data, not the specific content the agent saw at the moment of decision.
Q1.3 Do you log model version and policy version for each individual decision?

Version stamping covers: exact model name and version, model deployment timestamp, policy document version and effective date, prompt template version, and configuration state at decision time. The failure pattern: approval rate drops from 65% to 45% in March. Regulator asks why. You know rates changed but cannot show which version of the model or policy was running in February versus March, whether the change was a model update, a policy tightening, or a configuration change, or which specific decisions were made under which version. Without version stamps per decision, you cannot reconstruct causality for behavioral changes.

Red flags
  • "We know what model we're using now"
  • "Policy is documented but not versioned per decision"
  • "We'd have to check our deployment history"
  • "Version information isn't in the logs"
Gap identified. You cannot prove what model or policy was active for historical decisions. Cannot explain why decision patterns changed over time or verify consistent application of policy across periods.

Implementation methodology for closing evidence completeness gaps is covered in the Audit Trail Implementation Playbook — prompt retention architecture, context capture design, version stamping schemas, and retention policies that meet regulatory timeframes.

Join the waitlist for implementation access →
Section 2 of 5 · 3 questions
Replay Capability
Can you reconstruct the decision, or can you only prove it occurred? Reconstruction from policy and source data is inference. Replay from retained evidence is proof. Regulators and courts treat these differently.
Q2.1 For any past decision, can you produce a replay bundle — a single exportable package containing everything the agent saw and reasoned with?

A replay bundle contains: decision ID and timestamp, complete prompt, retrieved context, model and policy versions, tool calls and responses, intermediate reasoning steps, and final decision with rationale. The test: a regulator requests decision packets for 20 sampled applications. Without replay capability, each one requires 5–8 hours to reconstruct — pulling logs from multiple systems, inferring reasoning from available data, writing narrative explanations. The result is inference presented as evidence. With replay capability: query the audit system with decision IDs, export bundles, 15 minutes total. The difference is not just efficiency — it is the difference between "this is what happened" and "this is what we think happened."

Red flags
  • "We'd have to piece it together from multiple systems"
  • "We can reconstruct what probably happened"
  • "It takes days to respond to decision sampling requests"
  • "We explain what the agent should have done based on policy"
Gap identified. Regulatory inquiries require manual reconstruction. You're explaining what should have happened, not proving what did happen. This distinction matters in regulatory proceedings and litigation.
Q2.2 Do you use correlation IDs that link all evidence for a single decision across every system that touched it?

A correlation ID is a unique identifier per decision, present in all relevant log entries, linking prompt logs, context retrieval logs, tool call logs, model invocation logs, and final output. Without it: reconstructing a single decision requires finding the timestamp, searching each system for entries within a narrow time window, manually verifying the entries relate to the same decision, and hoping no other decisions occurred simultaneously. With it: one query returns every log entry for that decision across all systems. The difference is 30–60 minutes of manual detective work versus 30 seconds of a database query.

Red flags
  • "We correlate by timestamp"
  • "Each system has its own ID scheme"
  • "We manually piece together logs"
  • "Finding all evidence for one decision takes significant time"
Gap identified. Reconstructing decisions is manual and time-consuming. Risk of missing evidence or mis-correlating entries across systems. Doesn't scale to regulatory sampling requests.
Q2.3 Are your audit logs tamper-evident — can any modification be detected independently?

Tamper-evident logging includes: cryptographic hashing of log entries, immutable append-only storage, digital signatures, and independent timestamping. The litigation scenario: a denied applicant claims discrimination. You produce logs showing the decision. Their legal team asks how they can verify the logs weren't modified after the suit was filed — your IT team has database access. "We have policies against modifying logs" is not an answer to that question. Cryptographic integrity is. Each entry hashed and chained means any modification breaks the chain, which can be verified independently without trusting your assurances.

Red flags
  • "Logs are stored in a database with admin access"
  • "We have policies against modifying logs"
  • "Backups prove logs haven't changed" — backups can be modified too
  • "We trust our administrators"
Gap identified. Audit evidence can be challenged in regulatory inquiries or litigation. You cannot prove logs haven't been altered after the fact — integrity relies on policy, not on cryptographic proof.
Section 3 of 5 · 3 questions
Rationale Capture
Can you produce human-reviewable explanations of why the agent decided what it decided? "The agent approved it" is evidence that a decision occurred. It is not rationale. Rationale shows what factors were considered, how they were weighted, and why this conclusion rather than an alternative.
Q3.1 Do you log the agent's explanation of its reasoning, not just its conclusion?

Decision rationale includes: key factors considered, how factors were weighted, why this decision versus alternatives, confidence level, assumptions made, and risk considerations. The regulatory gap: a regulator asks why your agent denied an application and you can show "Decision: DENY, Timestamp: 15:42:35" — but not the factors that led there. Without documented rationale, the regulator cannot verify the decision was appropriate, check whether protected characteristics influenced it, or determine if policy was applied correctly. "We'd have to infer from the application data" is the answer that triggers deeper investigation.

Red flags
  • "We log the decision, not the reasoning"
  • "The model output is a classification"
  • "We can explain what the policy says, but not what the agent concluded"
  • "Rationale would be too verbose to log"
Gap identified. Decisions cannot be reviewed for appropriateness, bias, or policy compliance. You cannot explain to applicants why they were denied, and regulators cannot independently verify decisions were appropriate.
Q3.2 Do you capture intermediate reasoning steps — not just the final conclusion?

Intermediate steps include: data validation, initial screening results, factor scoring, risk classification, policy application logic, confidence assessment, and final synthesis. The difference between input-to-output logging and step capture is the difference between "it approved" and "it validated data, screened against minimum criteria, assessed financial strength, scored risk, applied policy limits, and assessed confidence before concluding approval." The regulator's concern with input-to-output only: "I can see it approved. I can't see the reasoning process. Did it consider all relevant factors? Did it weight them appropriately?" You cannot verify soundness of process from conclusion alone.

Red flags
  • "The agent's reasoning is internal to the model"
  • "We capture final decision only"
  • "Intermediate steps would be too much to log"
  • "We trust the model's reasoning process"
Gap identified. You can only verify if the conclusion matches policy, not if the path to get there was appropriate. Soundness of reasoning process cannot be independently reviewed.
Q3.3 Is your audit evidence reviewable by compliance teams, auditors, and regulators without engineering support?

Reviewable means: plain language explanations, structured scannable format, includes context not just data dumps, and can be understood by domain experts who are not engineers. The test: hand your audit evidence to a loan officer, a compliance auditor, and a legal team member. Can they understand what happened, verify the decision was appropriate, and make an independent determination without an engineer in the room? If the answer is "they'd need engineering to interpret the logs" or "we'd create a separate summary for regulators," your audit trail fails at its primary use case — being reviewed by the people who need to review it.

Red flags
  • "Our evidence is in log format"
  • "Compliance team asks engineering to explain decisions"
  • "We'd need to create a separate summary for regulators"
  • "The technical logs are our audit trail"
Gap identified. Every regulatory inquiry requires engineering support to translate logs. Audit evidence cannot be reviewed independently by the compliance, legal, or regulatory stakeholders who need to use it.
Section 4 of 5 · 3 questions
Exception Detection and Handling
Do you detect when agent behavior deviates from expectations? A behavioral shift that goes undetected for six weeks becomes a retroactive review of 2,000 decisions. Detection on day one limits exposure to 50.
Q4.1 Do you have a documented baseline of expected agent behavior with automated alerting when that baseline shifts?

Behavioral baseline includes: expected approval or denial rates, typical confidence distributions, normal reasoning patterns, expected factor weightings, and standard processing time. The failure mode: model updated in March, approval rate dropped from 65% to 45%, nobody noticed for six weeks. Discovered when the regulator asked why rates changed. By then, 2,000+ decisions had been made under the changed behavior. With baseline monitoring: day one alert — "approval rate dropped to 48%, expected 65% plus or minus 5%." Investigation within hours. Policy interpretation issue identified in new model version. Fifty decisions under anomalous behavior, reviewed immediately. The difference is scale of exposure.

Red flags
  • "We don't monitor agent behavior patterns"
  • "We only look at individual decisions"
  • "Behavioral changes discovered during audits"
  • "No alerting on rate shifts"
Gap identified. Agent behavior can shift significantly without detection. Anomalies are discovered months later during audits, by which point hundreds or thousands of decisions have been made under the changed behavior.
Q4.2 When individual decisions deviate from policy or expectations — low confidence, edge cases, unusual reasoning — are they automatically flagged for human review?

Deviation flagging covers: low confidence decisions below threshold, decisions near policy limits, unusual factor combinations, reasoning inconsistent with similar past decisions, and atypical patterns. The production failure: agent approved a $500K line — the largest ever granted. Standard log entry, no flag. Reviewed: never, treated as routine. Discovered eight months later during portfolio review when the borrower defaulted. The decision was at the extreme edge of policy and should have been escalated to a senior underwriter. No flagging mechanism existed for decisions at policy limits. With flagging: automatic trigger, senior review same day, approval confirmed with proper documentation, defensible decision on record.

Red flags
  • "All agent decisions are treated the same"
  • "We don't have edge case detection"
  • "Low confidence decisions aren't flagged"
  • "No escalation mechanism"
Gap identified. Edge cases and questionable decisions are not identified for human review. Risky decisions are treated as routine, and review only happens after adverse outcomes surface them.
Q4.3 Can you automatically halt agent decisions when significant anomalies are detected — before damage compounds?

Freeze mechanisms include: automatic pause on major behavioral shifts, kill switch for detected errors, human approval required for flagged decisions, and batch review queue before execution. The failure pattern without freeze: anomaly detected at 95% approval rate overnight, alert sent to engineering team, agent approved 400 applications before freeze implemented manually, prompt template error removed credit score check, 400 potentially inappropriate approvals requiring retroactive review, some loans already funded. With freeze: anomaly detected on first 10 decisions, automatic move to review queue, zero inappropriate approvals executed, error identified and fixed within two hours. The mechanism is not monitoring — it is the gate that monitoring triggers.

Red flags
  • "We monitor but can't halt automatically"
  • "Agent decisions execute in real-time without a gate"
  • "Anomalies are discovered after decisions are made"
  • "No kill switch mechanism"
Gap identified. When agent behavior becomes anomalous, decisions continue executing until manual intervention. Potential for hundreds of incorrect decisions before halt — and each one requires retroactive review.
Section 5 of 5 · 3 questions
Review and Reconstruction
Can auditors and regulators actually use your evidence to verify decisions? Having the evidence is necessary. Having it in a form that scales to regulatory sampling requests, passes tampering challenges, and surfaces issues before auditors do is what makes audit infrastructure defensible.
Q5.1 Do you have a systematic process for sampling and reviewing agent decisions before regulators ask you to?

A review workflow includes: regular sampling (for example, 2% of decisions monthly), risk-based sampling weighted toward high-value decisions, edge cases, and low confidence outputs, documented review criteria and checklist, sign-off process, issue escalation, and corrective action tracking. The difference: regulator samples 50 decisions and finds 8 with questionable rationale. If your answer is "we weren't aware of these issues, we don't have a regular review process," that triggers a deeper examination of your governance model. If your answer is "here are our monthly quality review reports showing what we sample, what we found, and what corrective actions we took," that closes the inquiry. Systematic review is the difference between discovering problems and having them discovered.

Red flags
  • "We only review when there's a complaint"
  • "We trust the agent's decisions"
  • "No regular sampling process"
  • "Quality review is ad-hoc"
Gap identified. No systematic quality assurance. Issues are discovered by regulators or customers rather than internally. First notice of a problem often comes with a request to explain it.
Q5.2 How long does it take to reconstruct complete decision evidence for a regulatory sampling request of 25 decisions?

Regulators have response deadlines, often 30 days. Sampling requests typically include 20–50 decisions, but a full audit may require hundreds. Manual reconstruction at 2–4 hours per decision means 50–100 hours of effort and 2–3 weeks of calendar time for 25 decisions — a response that nearly misses the deadline and arrives as inference rather than evidence. Automated export means querying the audit system with decision IDs, exporting replay bundles in 15 minutes, QA review in two hours, responding the same day. The regulator's concern with slow reconstruction isn't just operational — it signals that the evidence doesn't actually exist in a form that proves what happened.

Answer this one based on your current reality
  • Minutes: complete replay bundles exportable, automated process, 25 decisions under 30 minutes
  • Hours to days: manual gathering from multiple systems, some interpretation required
  • Cannot fully reconstruct: missing key evidence, must infer rather than prove
Gap identified. Cannot respond to regulatory inquiries efficiently. Manual reconstruction doesn't scale to full audits and produces inference rather than evidence. Slow response time signals to regulators that the evidence doesn't exist in ready form.
Q5.3 Have you validated your audit trail with compliance, legal, or external auditors — before an actual regulatory inquiry tests it?

Regulatory readiness validation includes: compliance team review of the audit approach, legal review for regulatory adequacy, external auditor testing of evidence, mock regulatory inquiry, and gap identification with remediation before the real examination. The failure pattern: first test is an actual regulatory exam. Discovery: cannot produce decision rationale, logs don't capture reasoning. Result: regulatory finding, remediation required under pressure, months of work. The alternative: mock inquiry six months before the expected audit, compliance team requests evidence for ten sample decisions, reconstruction reveals gaps in six of ten, remediation implemented, second test passes cleanly, actual audit finds no audit trail deficiencies. The gap between "we assume our logs are sufficient" and "we have tested our logs are sufficient" is where most regulatory findings originate.

Red flags
  • "We'll test it when regulators audit us"
  • "Compliance hasn't reviewed our approach"
  • "We assume our logs are sufficient"
  • "Never conducted a mock inquiry"
Gap identified. First test of audit trail adequacy will be an actual regulatory inquiry. Risk of significant findings and costly remediation under examination pressure.
Gaps requiring attention
Prioritization framework
Address firstPrompt retention — Q1.1 (without the exact prompt, no other evidence proves what the agent was instructed to do)
Address firstRetrieved context retention — Q1.2 (re-querying source systems returns current data, not what the agent saw at decision time)
Address firstVersion stamping per decision — Q1.3 (cannot explain behavioral changes over time without knowing what model and policy was active)
Address firstDecision rationale logging — Q3.1 (conclusions without rationale cannot be reviewed for appropriateness or bias)
Address firstFreeze on anomaly — Q4.3 (monitoring without a halt mechanism allows damage to compound before manual intervention)
Within 30 daysReplay bundle creation — Q2.1 (systematic export capability transforms regulatory response from days to minutes)
Within 30 daysCorrelation IDs across all systems — Q2.2 (end-to-end traceability by decision ID eliminates manual timestamp correlation)
Within 30 daysIntermediate reasoning steps — Q3.2 (soundness of reasoning process cannot be verified from conclusion alone)
Within 30 daysBehavioral baseline monitoring — Q4.1 (behavioral shifts that go undetected for weeks become retroactive reviews of thousands of decisions)
Within 30 daysNon-technical reviewability — Q3.3 (audit evidence that requires engineering support to interpret fails at its primary regulatory use case)
Document and monitorTamper-evident logging — Q2.3 (cryptographic integrity is a litigation requirement; policy-based controls are not equivalent)
Document and monitorDeviation flagging — Q4.2 (edge cases treated as routine decisions produce indefensible outcomes at policy limits)
Document and monitorSampling and review workflow — Q5.1 (systematic internal review catches issues before regulators do)
Document and monitorReconstruction time — Q5.2 (automated export capability is a prerequisite for regulatory response at scale)
Document and monitorRegulatory readiness validation — Q5.3 (mock inquiry identifies gaps while there is time to close them)
Next steps

Close these gaps. The Audit Trail Implementation Playbook covers step-by-step closure methodology for each control surface this playbook maps: replay bundle architecture, tamper-evident logging design, correlation ID implementation, rationale capture formats, and evidence structures that satisfy regulatory review without requiring engineering support to interpret.

Join the waitlist for implementation access →