Skip to main content

Eval Run Report: [run_id]

Date: YYYY-MM-DD
Model / Pipeline version: [version]
Case set: [all / capability name / list of case IDs]
Evaluator: [team or individual name]
Overall status: PASS / FAIL / PARTIAL


Summary

Capability-level pass/fail at a glance. A capability passes if: (a) score ≥ 0.85, (b) no Tier 1 field below 0.70, and (c) hallucination rate = 0.0.

CapabilityCases runAvg. case scoreProvenance scoreHallucinationsStatus
Loan Onboarding4PASS / FAIL
Covenant Monitoring4PASS / FAIL
BMT Validation3PASS / FAIL
Document Q&A4PASS / FAIL
Overall15PASS / FAIL

Detailed Results

Loan Onboarding

Case IDCase nameCase scoreProvenance scoreHallucinationNotes
LO-001Standard APLMAY/N
LO-002Amendment NoticeY/N
LO-003Multi-trancheY/N
LO-004Edge — Sparse DocY/N
Avg.

Tier 1 field scores (LO capability):

FieldAvg. score across casesPass (≥ 0.70)?
Borrower
Facility Amount
Currency
Maturity Date
Margin / Spread

Covenant Monitoring

Case IDCase nameD1 CoverageD2 TypeD3 ThresholdD4 FrequencyD5 Edge casesCase scoreNotes
CM-001Financial Covenants
CM-002Information Covenants
CM-003Negative Covenants
CM-004Waiver Scenario
Avg.

BMT Validation

Case IDCase nameCase scoreFalse positivesFalse negativesNotes
BM-001Market-Standard Deal
BM-002Deviation — Pricing
BM-003Novel Structure
Avg.

Document Q&A

Case IDCase nameD1 AccuracyD2 CitationD3 ScopeD4 UncertaintyD5 RefusalCase scoreNotes
QA-001Factual Retrieval
QA-002Cross-clause Reasoning
QA-003Ambiguous Term
QA-004Out-of-scope Question
Avg.

Grounded refusal accuracy: — / 1 (out-of-scope cases) = — (threshold: ≥ 0.95)


Regressions

Any case where the score is lower than the most recent previous run for the same case. If this is the first run, this section is N/A.

Case IDPrevious scoreCurrent scoreDeltaDescription of regression

Previous run reference: [run_id of last comparable run, or "N/A — first run"]


Hallucination Log

List every hallucination detected in this run. Each must have a root-cause note before the next run is permitted.

Case IDField / contextHallucinated valueRoot cause (preliminary)Resolved?

If no hallucinations: None detected in this run.


Action Items

Actions required before the next eval run or before release, depending on severity.

#SeverityDescriptionOwnerDue date
1[Action description]

Severity levels:

  • Release-blocking: Must be resolved before any production release.
  • High: Must be resolved before the next eval run.
  • Medium: Address within current sprint.
  • Low: Log and address in next planning cycle.

Evaluator Notes

[Free-form observations from the evaluator: document quality issues, unexpected model behaviour, edge cases not covered by current eval cases, recommendations for new regression cases.]


Sign-off

RoleNameDateSign-off
Product
QA
Engineering