AI Evaluations
This folder defines what "correct" means for every Smartflow AI capability. It is the shared contract between Product, QA, and Engineering for AI output quality.
No AI capability is ready for release until it has been evaluated against the cases in this folder and has met the passing thresholds defined in EVAL-STANDARDS.md.
Folder Structure
06-ai-evals/
├── README.md ← This file
├── EVAL-STANDARDS.md ← Authoritative quality contract (read this first)
├── ground-truth/
│ ├── _TEMPLATE-eval-case.md ← Template for new eval cases
│ ├── loan-onboarding/ ← 4 cases: LO-001 through LO-004
│ ├── covenant-monitoring/ ← 4 cases: CM-001 through CM-004
│ ├── bmt-validation/ ← 3 cases: BM-001 through BM-003
│ └── document-qa/ ← 4 cases: QA-001 through QA-004
├── scoring-rubrics/
│ ├── extraction-accuracy-rubric.md ← Loan onboarding scoring
│ ├── provenance-quality-rubric.md ← Provenance scoring (all capabilities)
│ ├── covenant-identification-rubric.md
│ └── qa-answer-quality-rubric.md
└── eval-history/
└── _TEMPLATE-eval-run.md ← Template for eval run reports
Quick Reference — 15 Eval Cases
| ID | Capability | Document | Difficulty |
|---|---|---|---|
| LO-001 | Loan Onboarding | acme-corp-facility-agreement.pdf | Standard |
| LO-002 | Loan Onboarding | delta-amendment-notice.pdf | Complex |
| LO-003 | Loan Onboarding | gamma-consortium-syndicated.pdf | Complex |
| LO-004 | Loan Onboarding | epsilon-messy-scan.pdf | Edge-case |
| CM-001 | Covenant Monitoring | acme-corp-facility-agreement.pdf | Standard |
| CM-002 | Covenant Monitoring | acme-corp-facility-agreement.pdf | Standard |
| CM-003 | Covenant Monitoring | gamma-consortium-syndicated.pdf | Complex |
| CM-004 | Covenant Monitoring | delta-amendment-notice.pdf | Complex |
| BM-001 | BMT Validation | acme-corp-facility-agreement.pdf | Standard |
| BM-002 | BMT Validation | gamma-consortium-syndicated.pdf | Complex |
| BM-003 | BMT Validation | betabank-revolving-credit.pdf | Complex |
| QA-001 | Document Q&A | acme-corp-facility-agreement.pdf | Standard |
| QA-002 | Document Q&A | acme-corp-facility-agreement.pdf | Complex |
| QA-003 | Document Q&A | acme-corp-facility-agreement.pdf | Complex |
| QA-004 | Document Q&A | acme-corp-facility-agreement.pdf | Standard |
All synthetic documents are located at 07-demos/demo-data/synthetic-documents/. No real customer documents are used in evaluations.
Passing Thresholds
| Scope | Threshold |
|---|---|
| Per-capability score | ≥ 0.85 |
| Any single Tier 1 field (Loan Onboarding) | ≥ 0.70 average across all cases |
| Hallucination rate | 0.0 (zero tolerance) |
| Provenance completeness | ≥ 0.90 |
| Grounded refusal accuracy (Q&A) | ≥ 0.95 |
How to Add a New Eval Case
-
Copy the template: Copy
ground-truth/_TEMPLATE-eval-case.mdto the appropriate capability subfolder. -
Choose the next sequential ID:
- Loan Onboarding:
LO-NNN - Covenant Monitoring:
CM-NNN - BMT Validation:
BM-NNN - Document Q&A:
QA-NNN
- Loan Onboarding:
-
Fill every section. No eval case may be merged with an empty ground truth citation. Every case must have a verbatim source quote that is findable in the source PDF.
-
Verify the document reference. The
document_reffrontmatter must match an exact filename in07-demos/demo-data/synthetic-documents/. Cases referencing documents that do not exist in that folder will fail the quality check. -
Classify difficulty: Standard (clean document, unambiguous fields), Complex (multi-entity, amendment, or multi-clause synthesis required), or Edge-case (degraded input, out-of-scope, novel structure).
-
If adding a regression case (triggered by a past failure): document the failure in the
Regression Notesection — the failing model version, the specific error, and the date it was first observed. -
Submit for peer review. A second person must confirm the ground truth citation is accurate before the case is set to
status: active. Usestatus: under-reviewuntil then.
How to Run Evaluations
Manual review
- Submit the source document to the Smartflow model under evaluation via the standard ingestion pipeline.
- Collect the model output (extraction JSON, covenant list, BMT report, or Q&A response).
- For each eval case, compare the model output to the expected output in the case file field-by-field.
- Apply the appropriate rubric to score each field/dimension.
- Calculate the case score using the aggregation formula in the rubric.
- Record results in a new eval run report (copied from
eval-history/_TEMPLATE-eval-run.md). - Flag any hallucinations immediately. Do not proceed with capability scoring until hallucinations are root-cause-analysed.
CI integration
When eval cases are integrated into a CI pipeline, the pipeline must:
- Load each eval case's expected output as the acceptance criterion.
- Submit the source document via the standard API.
- Compare model output to expected output using the rubric scoring logic.
- Report case scores and capability-level aggregate.
- Fail the CI run if any capability is below the passing threshold or any hallucination is detected.
- Output a structured eval run report matching
_TEMPLATE-eval-run.md.
Eval cases are designed to be deterministic (fixed inputs, exact expected outputs) so that CI results are reproducible.
How to Interpret Scores
| Score range | Interpretation | Action |
|---|---|---|
| 0.95–1.0 | Excellent — exceeds threshold | Ship with confidence |
| 0.85–0.94 | Passing — meets threshold | Ship; log any weak cases for improvement |
| 0.70–0.84 | Below threshold | Do not release; prioritise targeted improvement on failing cases |
| 0.50–0.69 | Significant gap | Investigation required; likely a systematic issue in a specific field or document type |
| < 0.50 | Critical failure | Escalate to engineering; do not release under any circumstances |
| Any hallucination | Override — score irrelevant | Stop evaluation; root-cause analysis required before re-run |
When to Add Regression Cases
Add a regression case when:
- A model failure in production or a prior eval run reveals a gap not covered by any existing case.
- A new document type, governing law, or instrument structure is added to Smartflow's supported scope.
- A previously-passing field drops below threshold in a new eval run.
- A new edge case is discovered during demo preparation, pilot testing, or scoping sessions.
Regression cases use status: active and must have Regression Note populated with the failure details. They are included in all future capability score calculations.
Related Documents
- EVAL-STANDARDS.md — Product contract: authoritative definitions of correct output
- Extraction Accuracy Rubric
- Provenance Quality Rubric
- Covenant Identification Rubric
- Q&A Answer Quality Rubric
- Eval Run Template
- Demos README