Skip to main content

AI Evaluations

This folder defines what "correct" means for every Smartflow AI capability. It is the shared contract between Product, QA, and Engineering for AI output quality.

This is the source of truth for acceptance

No AI capability is ready for release until it has been evaluated against the cases in this folder and has met the passing thresholds defined in EVAL-STANDARDS.md.


Folder Structure

06-ai-evals/
├── README.md ← This file
├── EVAL-STANDARDS.md ← Authoritative quality contract (read this first)
├── ground-truth/
│ ├── _TEMPLATE-eval-case.md ← Template for new eval cases
│ ├── loan-onboarding/ ← 4 cases: LO-001 through LO-004
│ ├── covenant-monitoring/ ← 4 cases: CM-001 through CM-004
│ ├── bmt-validation/ ← 3 cases: BM-001 through BM-003
│ └── document-qa/ ← 4 cases: QA-001 through QA-004
├── scoring-rubrics/
│ ├── extraction-accuracy-rubric.md ← Loan onboarding scoring
│ ├── provenance-quality-rubric.md ← Provenance scoring (all capabilities)
│ ├── covenant-identification-rubric.md
│ └── qa-answer-quality-rubric.md
└── eval-history/
└── _TEMPLATE-eval-run.md ← Template for eval run reports

Quick Reference — 15 Eval Cases

IDCapabilityDocumentDifficulty
LO-001Loan Onboardingacme-corp-facility-agreement.pdfStandard
LO-002Loan Onboardingdelta-amendment-notice.pdfComplex
LO-003Loan Onboardinggamma-consortium-syndicated.pdfComplex
LO-004Loan Onboardingepsilon-messy-scan.pdfEdge-case
CM-001Covenant Monitoringacme-corp-facility-agreement.pdfStandard
CM-002Covenant Monitoringacme-corp-facility-agreement.pdfStandard
CM-003Covenant Monitoringgamma-consortium-syndicated.pdfComplex
CM-004Covenant Monitoringdelta-amendment-notice.pdfComplex
BM-001BMT Validationacme-corp-facility-agreement.pdfStandard
BM-002BMT Validationgamma-consortium-syndicated.pdfComplex
BM-003BMT Validationbetabank-revolving-credit.pdfComplex
QA-001Document Q&Aacme-corp-facility-agreement.pdfStandard
QA-002Document Q&Aacme-corp-facility-agreement.pdfComplex
QA-003Document Q&Aacme-corp-facility-agreement.pdfComplex
QA-004Document Q&Aacme-corp-facility-agreement.pdfStandard

All synthetic documents are located at 07-demos/demo-data/synthetic-documents/. No real customer documents are used in evaluations.


Passing Thresholds

ScopeThreshold
Per-capability score≥ 0.85
Any single Tier 1 field (Loan Onboarding)≥ 0.70 average across all cases
Hallucination rate0.0 (zero tolerance)
Provenance completeness≥ 0.90
Grounded refusal accuracy (Q&A)≥ 0.95

How to Add a New Eval Case

  1. Copy the template: Copy ground-truth/_TEMPLATE-eval-case.md to the appropriate capability subfolder.

  2. Choose the next sequential ID:

    • Loan Onboarding: LO-NNN
    • Covenant Monitoring: CM-NNN
    • BMT Validation: BM-NNN
    • Document Q&A: QA-NNN
  3. Fill every section. No eval case may be merged with an empty ground truth citation. Every case must have a verbatim source quote that is findable in the source PDF.

  4. Verify the document reference. The document_ref frontmatter must match an exact filename in 07-demos/demo-data/synthetic-documents/. Cases referencing documents that do not exist in that folder will fail the quality check.

  5. Classify difficulty: Standard (clean document, unambiguous fields), Complex (multi-entity, amendment, or multi-clause synthesis required), or Edge-case (degraded input, out-of-scope, novel structure).

  6. If adding a regression case (triggered by a past failure): document the failure in the Regression Note section — the failing model version, the specific error, and the date it was first observed.

  7. Submit for peer review. A second person must confirm the ground truth citation is accurate before the case is set to status: active. Use status: under-review until then.


How to Run Evaluations

Manual review

  1. Submit the source document to the Smartflow model under evaluation via the standard ingestion pipeline.
  2. Collect the model output (extraction JSON, covenant list, BMT report, or Q&A response).
  3. For each eval case, compare the model output to the expected output in the case file field-by-field.
  4. Apply the appropriate rubric to score each field/dimension.
  5. Calculate the case score using the aggregation formula in the rubric.
  6. Record results in a new eval run report (copied from eval-history/_TEMPLATE-eval-run.md).
  7. Flag any hallucinations immediately. Do not proceed with capability scoring until hallucinations are root-cause-analysed.

CI integration

When eval cases are integrated into a CI pipeline, the pipeline must:

  • Load each eval case's expected output as the acceptance criterion.
  • Submit the source document via the standard API.
  • Compare model output to expected output using the rubric scoring logic.
  • Report case scores and capability-level aggregate.
  • Fail the CI run if any capability is below the passing threshold or any hallucination is detected.
  • Output a structured eval run report matching _TEMPLATE-eval-run.md.

Eval cases are designed to be deterministic (fixed inputs, exact expected outputs) so that CI results are reproducible.


How to Interpret Scores

Score rangeInterpretationAction
0.95–1.0Excellent — exceeds thresholdShip with confidence
0.85–0.94Passing — meets thresholdShip; log any weak cases for improvement
0.70–0.84Below thresholdDo not release; prioritise targeted improvement on failing cases
0.50–0.69Significant gapInvestigation required; likely a systematic issue in a specific field or document type
< 0.50Critical failureEscalate to engineering; do not release under any circumstances
Any hallucinationOverride — score irrelevantStop evaluation; root-cause analysis required before re-run

When to Add Regression Cases

Add a regression case when:

  • A model failure in production or a prior eval run reveals a gap not covered by any existing case.
  • A new document type, governing law, or instrument structure is added to Smartflow's supported scope.
  • A previously-passing field drops below threshold in a new eval run.
  • A new edge case is discovered during demo preparation, pilot testing, or scoping sessions.

Regression cases use status: active and must have Regression Note populated with the failure details. They are included in all future capability score calculations.