Covenant Identification Rubric
This rubric applies to all Covenant Monitoring evaluations. It scores covenant identification across five independent dimensions. All dimensions must be scored for a complete evaluation; the composite is a weighted average.
Covenant type taxonomy is defined in EVAL-STANDARDS.md Section 2.
1. Scoring Dimensions
| Dimension | Weight | Description |
|---|---|---|
| D1 — Coverage | 30% | Proportion of covenants in the ground truth that were found |
| D2 — Type classification accuracy | 20% | Each found covenant correctly classified by type (financial / information / negative / positive) |
| D3 — Threshold extraction accuracy | 25% | Threshold or obligation value correctly extracted verbatim |
| D4 — Testing frequency accuracy | 15% | Correct cadence (quarterly / semi-annual / annual / per the trigger) |
| D5 — Edge case handling | 10% | Waivers, carve-outs, grace periods — correctly flagged or correctly noted as absent |
2. Dimension Scoring Rules
D1 — Coverage
Coverage measures recall: how many of the covenants in the ground truth were successfully identified?
$$D1 = \frac{\text{Covenants correctly identified}}{\text{Total covenants in ground truth}}$$
A covenant is "correctly identified" if its name and subject matter are recognisable in the output (even if threshold or type is wrong — those deductions apply in D3 and D2 respectively).
| Coverage | D1 Score |
|---|---|
| 100% (all covenants found) | 1.0 |
| 90–99% | 0.9 |
| 75–89% | 0.75 |
| 50–74% | 0.50 |
| < 50% | 0.25 |
| 0% (no covenants found) | 0.0 |
D2 — Type Classification Accuracy
Score each found covenant's type classification independently:
| Classification result | Score |
|---|---|
| Correct type (matches EVAL-STANDARDS taxonomy exactly) | 1.0 |
| Adjacent type (e.g., "maintenance" as a non-taxonomy label that maps to "financial") | 0.75 |
| Wrong type within taxonomy (e.g., financial classified as positive) | 0.0 |
| No type provided | 0.25 |
D2 = arithmetic mean of individual classification scores across all found covenants.
D3 — Threshold Extraction Accuracy
For financial covenants: score the extracted ratio or threshold value.
For information covenants: score the delivery deadline (in days).
For negative covenants: score the principal prohibition AND any quantitative carve-out caps.
For positive covenants: score the obligation description accuracy.
| Extraction result | Score |
|---|---|
| Verbatim value correct (e.g., "≤ 4.0×", "within 120 days") | 1.0 |
| Value correct but paraphrased (e.g., "4 times" instead of "4.0×") | 0.75 |
| Value partially correct (e.g., threshold correct but testing period wrong) | 0.50 |
| Carve-out cap extracted but with wrong value (e.g., "USD 20M" when source says USD 25M) | 0.0 for that carve-out |
| Value absent (covenant identified but threshold not extracted) | 0.25 |
| Value fabricated | 0.0 + hallucination flag |
D3 = arithmetic mean of individual threshold scores across all found covenants.
D4 — Testing Frequency Accuracy
| Result | Score |
|---|---|
| Exact cadence match (e.g., "quarterly — last day of each financial quarter") | 1.0 |
| Cadence correct but period description imprecise (e.g., "every 3 months" vs. "quarterly") | 0.75 |
| Cadence off by one level (e.g., quarterly reported as semi-annual) | 0.25 |
| Cadence off by more than one level (e.g., quarterly reported as annual) | 0.0 |
| Testing frequency not extracted | 0.25 |
| Testing frequency not applicable (positive/information covenant with event-based trigger) and correctly noted as "upon occurrence" | 1.0 |
D4 = arithmetic mean of individual frequency scores across applicable covenants.
D5 — Edge Case Handling
This dimension scores whether edge cases (waivers, carve-outs, grace periods) are correctly identified and handled. If no edge cases are present in the document, this dimension scores 1.0 by default (correct absence).
| Edge case | Correct handling | Score |
|---|---|---|
| Waiver present — correctly identified as temporary with original + waived threshold + scope | Full correct handling | 1.0 |
| Waiver present — identified but characterised as permanent | Critical failure | 0.0 |
| Waiver present — identified but scope or thresholds incomplete | Partial | 0.5 |
| Waiver present — not identified | Miss | 0.0 |
| Carve-outs present — all carve-outs listed with caps | Full correct handling | 1.0 |
| Carve-outs present — listed without quantitative caps | Partial | 0.5 |
| Carve-outs present — not listed at all | Miss | 0.0 |
| Grace period present — correctly extracted with trigger and duration | Full correct handling | 1.0 |
| Grace period present — not extracted | Miss | 0.0 |
| No edge cases present AND system correctly states no waivers/carve-outs found | Correct absence | 1.0 |
| No edge cases present AND system fabricates a waiver | Hallucination | 0.0 |
D5 = arithmetic mean of individual edge case scores across all applicable edge cases.
3. Composite Case Score
$$\text{Case Score} = 0.30 \times D1 + 0.20 \times D2 + 0.25 \times D3 + 0.15 \times D4 + 0.10 \times D5$$
Worked example (CM-004 waiver scenario)
| Dimension | Score | Weight | Weighted |
|---|---|---|---|
| D1 — Coverage | 1.0 (waiver covenant identified) | 0.30 | 0.30 |
| D2 — Type classification | 1.0 (correctly financial) | 0.20 | 0.20 |
| D3 — Threshold extraction | 0.75 (both thresholds captured; waiver period stated but one date off) | 0.25 | 0.1875 |
| D4 — Testing frequency | 1.0 (Q1 2026 testing date correct) | 0.15 | 0.15 |
| D5 — Edge case handling | 0.75 (waiver identified as temporary; reversion date not stated) | 0.10 | 0.075 |
Case Score = 0.30 + 0.20 + 0.1875 + 0.15 + 0.075 = 0.9125
4. False Positive Rate
False positives (covenants identified that are not in the ground truth) are penalised separately.
$$\text{False Positive Penalty} = \frac{\text{False positives identified}}{N_{\text{ground truth covenants}}} \times 0.25$$
This penalty is subtracted from the Case Score. Maximum penalty: 0.25 (so a case with many false positives but zero true positive coverage cannot go below 0.0 due to the penalty alone).
5. Per-Capability Score
$$\text{Covenant Monitoring Score} = \frac{\sum_c \text{Case Score}c}{N{\text{cases}}}$$
Minimum passing threshold: 0.85 (consistent with EVAL-STANDARDS.md Section 6).
6. Hallucination Override
Any fabricated threshold value or fabricated waiver/carve-out that has no basis in the source document sets the entire case score to 0.0. This overrides all other dimension scores.
A waiver incorrectly characterised as permanent (not fabricated, but a critical interpretation error) does not trigger the hallucination override but sets D5 = 0.0 and should be flagged as a Critical Failure in the evaluation log. Critical Failures block release regardless of overall score.