Eval Run Report: [run_id]
Date: YYYY-MM-DD
Model / Pipeline version: [version]
Case set: [all / capability name / list of case IDs]
Evaluator: [team or individual name]
Overall status: PASS / FAIL / PARTIAL
Summary
Capability-level pass/fail at a glance. A capability passes if: (a) score ≥ 0.85, (b) no Tier 1 field below 0.70, and (c) hallucination rate = 0.0.
| Capability | Cases run | Avg. case score | Provenance score | Hallucinations | Status |
|---|---|---|---|---|---|
| Loan Onboarding | 4 | — | — | — | PASS / FAIL |
| Covenant Monitoring | 4 | — | — | — | PASS / FAIL |
| BMT Validation | 3 | — | — | — | PASS / FAIL |
| Document Q&A | 4 | — | — | — | PASS / FAIL |
| Overall | 15 | — | — | — | PASS / FAIL |
Detailed Results
Loan Onboarding
| Case ID | Case name | Case score | Provenance score | Hallucination | Notes |
|---|---|---|---|---|---|
| LO-001 | Standard APLMA | — | — | Y/N | |
| LO-002 | Amendment Notice | — | — | Y/N | |
| LO-003 | Multi-tranche | — | — | Y/N | |
| LO-004 | Edge — Sparse Doc | — | — | Y/N | |
| Avg. | — | — |
Tier 1 field scores (LO capability):
| Field | Avg. score across cases | Pass (≥ 0.70)? |
|---|---|---|
| Borrower | — | — |
| Facility Amount | — | — |
| Currency | — | — |
| Maturity Date | — | — |
| Margin / Spread | — | — |
Covenant Monitoring
| Case ID | Case name | D1 Coverage | D2 Type | D3 Threshold | D4 Frequency | D5 Edge cases | Case score | Notes |
|---|---|---|---|---|---|---|---|---|
| CM-001 | Financial Covenants | — | — | — | — | — | — | |
| CM-002 | Information Covenants | — | — | — | — | — | — | |
| CM-003 | Negative Covenants | — | — | — | — | — | — | |
| CM-004 | Waiver Scenario | — | — | — | — | — | — | |
| Avg. | — | — | — | — | — | — |
BMT Validation
| Case ID | Case name | Case score | False positives | False negatives | Notes |
|---|---|---|---|---|---|
| BM-001 | Market-Standard Deal | — | — | — | |
| BM-002 | Deviation — Pricing | — | — | — | |
| BM-003 | Novel Structure | — | — | — | |
| Avg. | — | — | — |
Document Q&A
| Case ID | Case name | D1 Accuracy | D2 Citation | D3 Scope | D4 Uncertainty | D5 Refusal | Case score | Notes |
|---|---|---|---|---|---|---|---|---|
| QA-001 | Factual Retrieval | — | — | — | — | — | — | |
| QA-002 | Cross-clause Reasoning | — | — | — | — | — | — | |
| QA-003 | Ambiguous Term | — | — | — | — | — | — | |
| QA-004 | Out-of-scope Question | — | — | — | — | — | — | |
| Avg. | — | — | — | — | — | — |
Grounded refusal accuracy: — / 1 (out-of-scope cases) = — (threshold: ≥ 0.95)
Regressions
Any case where the score is lower than the most recent previous run for the same case. If this is the first run, this section is N/A.
| Case ID | Previous score | Current score | Delta | Description of regression |
|---|---|---|---|---|
| — | — | — | — |
Previous run reference: [run_id of last comparable run, or "N/A — first run"]
Hallucination Log
List every hallucination detected in this run. Each must have a root-cause note before the next run is permitted.
| Case ID | Field / context | Hallucinated value | Root cause (preliminary) | Resolved? |
|---|---|---|---|---|
| — | — | — | — | — |
If no hallucinations: None detected in this run.
Action Items
Actions required before the next eval run or before release, depending on severity.
| # | Severity | Description | Owner | Due date |
|---|---|---|---|---|
| 1 | — | [Action description] | — | — |
Severity levels:
- Release-blocking: Must be resolved before any production release.
- High: Must be resolved before the next eval run.
- Medium: Address within current sprint.
- Low: Log and address in next planning cycle.
Evaluator Notes
[Free-form observations from the evaluator: document quality issues, unexpected model behaviour, edge cases not covered by current eval cases, recommendations for new regression cases.]
Sign-off
| Role | Name | Date | Sign-off |
|---|---|---|---|
| Product | — | — | — |
| QA | — | — | — |
| Engineering | — | — | — |