Why AI Governance Assessments Need Runtime Evidence

An AI governance assessment without runtime evidence is just an opinion delivered with confidence. Most produce a maturity score or a tidy checklist of controls—and stop there. They don’t prove that any of those controls actually run when the model takes action.

The assessment market is stuck on policy

Walk through ten AI governance assessments today and you’ll see the same shape. A vendor or consultant interviews the team, reviews the policy binder, maps controls to NIST AI RMF or ISO/IEC 42001, and hands back a maturity rating. Sometimes there’s a heat map. Sometimes a remediation plan. Almost never is there evidence that the controls in the binder actually fired the last time the AI did something consequential.

That worked when AI sat behind a chat window and a human approved every action. It doesn’t hold up when the AI is autonomously routing tickets, scheduling appointments, drafting clinical notes, or executing trades. Once the model acts, “we have a policy that says the model shouldn’t do X” isn’t the same kind of statement as “here is the signed receipt showing the X-blocking control ran on this transaction at this timestamp.”

The next question reviewers are asking

We sit in a lot of enterprise security reviews. The question that’s changed in the last twelve months isn’t “do you have an AI policy?” or “are you ISO 42001 certified?” Those are table stakes now. The new question, asked in plain language by reviewers, regulators, auditors, and increasingly by boards, is some version of:

“Show me the evidence the controls fired.”

Not the policy. Not the architecture diagram. The evidence—per transaction, per agent action, per high-risk decision—that the prompt was inspected, the output was checked, the PII was redacted, the human reviewed the recommendation, the model didn’t exfiltrate data. A maturity score doesn’t answer that question. A signed runtime receipt does.

What an evidence-grade assessment produces

The next generation of AI governance assessment has to produce two artifacts, not one:

A hardening plan — the controls that need to exist on the AI workflow, mapped to the frameworks the buyer cares about (NIST AI RMF, ISO/IEC 42001, HIPAA, EU AI Act, SR 11-7).
A path to runtime evidence — signed evidence receipts that prove each of those controls ran when the AI took action, packaged so a security reviewer or auditor can verify them without taking anyone’s word.

The first artifact is what most assessments produce today. The second is what closes deals, satisfies regulators, and survives the first incident. Without it, the hardening plan is a promise. With it, the hardening plan is a contract the system enforces on itself every time it acts.

A worked example: the Sprint we run

We built the Glacis Agent Runtime Security & Evidence Sprint to make this concrete on a single workflow rather than across an entire organization. It’s a paid, fixed-scope engagement: one named AI workflow, ten business days, $48k. The output is three artifacts the customer can hand to their next enterprise reviewer:

A runtime control plan — the specific local runtime controls that need to sit in front of, around, and behind the AI on that workflow, with the failure modes each one is designed to catch.
An evidence gap map — for each control, what evidence the system can produce today versus what’s missing, and what it would take to close the gap with signed evidence receipts.
A customer-facing security review artifact — a packaged document the customer’s sales or compliance team can give to an enterprise reviewer to short-circuit the next 60 days of questionnaires.

The Sprint runs inside the customer’s infrastructure. There’s zero sensitive-data egress—the local runtime controls and the evidence packs they emit stay where the data lives. We don’t centralize prompts, outputs, or logs. The artifact the customer hands to a reviewer is one they own, signed by their own infrastructure, not by us.

If your next assessment doesn’t end in a receipt

The test is simple. After the assessment is done, can you answer the “show me the evidence the controls fired” question for one specific AI action that happened last week? If the answer is “we have a policy that says we would have caught that,” the assessment hasn’t finished its job—it’s described the world as you’d like it to be, not as your runtime actually behaves.

Maturity scores are useful as a starting point. Hardening plans are useful as a roadmap. Neither is sufficient on its own anymore. The bar that matters is whether the assessment leaves you with a credible path to runtime evidence on the workflows where AI is making decisions that affect people, money, or care.

For healthcare AI vendors: HIPAA’s technical safeguards include audit controls under 45 C.F.R. § 164.312(b). The standard doesn’t describe a maturity rating—it describes records of activity. A governance assessment that maps to HIPAA without producing per-action evidence is mapping to the wrong half of the rule.

Navigate

Product

Solutions

Evidence

Why AI governance assessments need runtime evidence

The assessment market is stuck on policy

The next question reviewers are asking

What an evidence-grade assessment produces

A worked example: the Sprint we run

If your next assessment doesn’t end in a receipt

Primary sources

Turn your governance plan into runtime evidence

Related guides

NIST AI Risk Management Framework

ISO 42001 Guide

The Proof Gap