AI Evaluations

This folder defines what "correct" means for every Smartflow AI capability. It is the shared contract between Product, QA, and Engineering for AI output quality.

This is the source of truth for acceptance

No AI capability is ready for release until it has been evaluated against the cases in this folder and has met the passing thresholds defined in EVAL-STANDARDS.md.

Folder Structure

06-ai-evals/
├── README.md                          ← This file
├── EVAL-STANDARDS.md                  ← Authoritative quality contract (read this first)
├── ground-truth/
│   ├── _TEMPLATE-eval-case.md         ← Template for new eval cases
│   ├── loan-onboarding/               ← 4 cases: LO-001 through LO-004
│   ├── covenant-monitoring/           ← 4 cases: CM-001 through CM-004
│   ├── bmt-validation/                ← 3 cases: BM-001 through BM-003
│   └── document-qa/                   ← 4 cases: QA-001 through QA-004
├── scoring-rubrics/
│   ├── extraction-accuracy-rubric.md  ← Loan onboarding scoring
│   ├── provenance-quality-rubric.md   ← Provenance scoring (all capabilities)
│   ├── covenant-identification-rubric.md
│   └── qa-answer-quality-rubric.md
└── eval-history/
    └── _TEMPLATE-eval-run.md          ← Template for eval run reports

Quick Reference — 15 Eval Cases

ID	Capability	Document	Difficulty
LO-001	Loan Onboarding	acme-corp-facility-agreement.pdf	Standard
LO-002	Loan Onboarding	delta-amendment-notice.pdf	Complex
LO-003	Loan Onboarding	gamma-consortium-syndicated.pdf	Complex
LO-004	Loan Onboarding	epsilon-messy-scan.pdf	Edge-case
CM-001	Covenant Monitoring	acme-corp-facility-agreement.pdf	Standard
CM-002	Covenant Monitoring	acme-corp-facility-agreement.pdf	Standard
CM-003	Covenant Monitoring	gamma-consortium-syndicated.pdf	Complex
CM-004	Covenant Monitoring	delta-amendment-notice.pdf	Complex
BM-001	BMT Validation	acme-corp-facility-agreement.pdf	Standard
BM-002	BMT Validation	gamma-consortium-syndicated.pdf	Complex
BM-003	BMT Validation	betabank-revolving-credit.pdf	Complex
QA-001	Document Q&A	acme-corp-facility-agreement.pdf	Standard
QA-002	Document Q&A	acme-corp-facility-agreement.pdf	Complex
QA-003	Document Q&A	acme-corp-facility-agreement.pdf	Complex
QA-004	Document Q&A	acme-corp-facility-agreement.pdf	Standard

All synthetic documents are located at 07-demos/demo-data/synthetic-documents/. No real customer documents are used in evaluations.

Passing Thresholds

Scope	Threshold
Per-capability score	≥ 0.85
Any single Tier 1 field (Loan Onboarding)	≥ 0.70 average across all cases
Hallucination rate	0.0 (zero tolerance)
Provenance completeness	≥ 0.90
Grounded refusal accuracy (Q&A)	≥ 0.95

How to Add a New Eval Case

Copy the template: Copy ground-truth/_TEMPLATE-eval-case.md to the appropriate capability subfolder.
Choose the next sequential ID:
- Loan Onboarding: LO-NNN
- Covenant Monitoring: CM-NNN
- BMT Validation: BM-NNN
- Document Q&A: QA-NNN
Fill every section. No eval case may be merged with an empty ground truth citation. Every case must have a verbatim source quote that is findable in the source PDF.
Verify the document reference. The document_ref frontmatter must match an exact filename in 07-demos/demo-data/synthetic-documents/. Cases referencing documents that do not exist in that folder will fail the quality check.
Classify difficulty: Standard (clean document, unambiguous fields), Complex (multi-entity, amendment, or multi-clause synthesis required), or Edge-case (degraded input, out-of-scope, novel structure).
If adding a regression case (triggered by a past failure): document the failure in the Regression Note section — the failing model version, the specific error, and the date it was first observed.
Submit for peer review. A second person must confirm the ground truth citation is accurate before the case is set to status: active. Use status: under-review until then.

How to Run Evaluations

Manual review

Submit the source document to the Smartflow model under evaluation via the standard ingestion pipeline.
Collect the model output (extraction JSON, covenant list, BMT report, or Q&A response).
For each eval case, compare the model output to the expected output in the case file field-by-field.
Apply the appropriate rubric to score each field/dimension.
Calculate the case score using the aggregation formula in the rubric.
Record results in a new eval run report (copied from eval-history/_TEMPLATE-eval-run.md).
Flag any hallucinations immediately. Do not proceed with capability scoring until hallucinations are root-cause-analysed.

CI integration

When eval cases are integrated into a CI pipeline, the pipeline must:

Load each eval case's expected output as the acceptance criterion.
Submit the source document via the standard API.
Compare model output to expected output using the rubric scoring logic.
Report case scores and capability-level aggregate.
Fail the CI run if any capability is below the passing threshold or any hallucination is detected.
Output a structured eval run report matching _TEMPLATE-eval-run.md.

Eval cases are designed to be deterministic (fixed inputs, exact expected outputs) so that CI results are reproducible.

How to Interpret Scores

Score range	Interpretation	Action
0.95–1.0	Excellent — exceeds threshold	Ship with confidence
0.85–0.94	Passing — meets threshold	Ship; log any weak cases for improvement
0.70–0.84	Below threshold	Do not release; prioritise targeted improvement on failing cases
0.50–0.69	Significant gap	Investigation required; likely a systematic issue in a specific field or document type
< 0.50	Critical failure	Escalate to engineering; do not release under any circumstances
Any hallucination	Override — score irrelevant	Stop evaluation; root-cause analysis required before re-run

When to Add Regression Cases

Add a regression case when:

A model failure in production or a prior eval run reveals a gap not covered by any existing case.
A new document type, governing law, or instrument structure is added to Smartflow's supported scope.
A previously-passing field drops below threshold in a new eval run.
A new edge case is discovered during demo preparation, pilot testing, or scoping sessions.

Regression cases use status: active and must have Regression Note populated with the failure details. They are included in all future capability score calculations.

EVAL-STANDARDS.md — Product contract: authoritative definitions of correct output
Extraction Accuracy Rubric
Provenance Quality Rubric
Covenant Identification Rubric
Q&A Answer Quality Rubric
Eval Run Template
Demos README

Folder Structure​

Quick Reference — 15 Eval Cases​

Passing Thresholds​

How to Add a New Eval Case​

How to Run Evaluations​

Manual review​

CI integration​

How to Interpret Scores​

When to Add Regression Cases​

Related Documents​