Covenant Identification Rubric

This rubric applies to all Covenant Monitoring evaluations. It scores covenant identification across five independent dimensions. All dimensions must be scored for a complete evaluation; the composite is a weighted average.

Covenant type taxonomy is defined in EVAL-STANDARDS.md Section 2.

1. Scoring Dimensions

Dimension	Weight	Description
D1 — Coverage	30%	Proportion of covenants in the ground truth that were found
D2 — Type classification accuracy	20%	Each found covenant correctly classified by type (financial / information / negative / positive)
D3 — Threshold extraction accuracy	25%	Threshold or obligation value correctly extracted verbatim
D4 — Testing frequency accuracy	15%	Correct cadence (quarterly / semi-annual / annual / per the trigger)
D5 — Edge case handling	10%	Waivers, carve-outs, grace periods — correctly flagged or correctly noted as absent

2. Dimension Scoring Rules

D1 — Coverage

Coverage measures recall: how many of the covenants in the ground truth were successfully identified?

$$D1 = \frac{\text{Covenants correctly identified}}{\text{Total covenants in ground truth}}$$

A covenant is "correctly identified" if its name and subject matter are recognisable in the output (even if threshold or type is wrong — those deductions apply in D3 and D2 respectively).

Coverage	D1 Score
100% (all covenants found)	1.0
90–99%	0.9
75–89%	0.75
50–74%	0.50
< 50%	0.25
0% (no covenants found)	0.0

D2 — Type Classification Accuracy

Score each found covenant's type classification independently:

Classification result	Score
Correct type (matches EVAL-STANDARDS taxonomy exactly)	1.0
Adjacent type (e.g., "maintenance" as a non-taxonomy label that maps to "financial")	0.75
Wrong type within taxonomy (e.g., financial classified as positive)	0.0
No type provided	0.25

D2 = arithmetic mean of individual classification scores across all found covenants.

D3 — Threshold Extraction Accuracy

For financial covenants: score the extracted ratio or threshold value.
For information covenants: score the delivery deadline (in days).
For negative covenants: score the principal prohibition AND any quantitative carve-out caps.
For positive covenants: score the obligation description accuracy.

Extraction result	Score
Verbatim value correct (e.g., "≤ 4.0×", "within 120 days")	1.0
Value correct but paraphrased (e.g., "4 times" instead of "4.0×")	0.75
Value partially correct (e.g., threshold correct but testing period wrong)	0.50
Carve-out cap extracted but with wrong value (e.g., "USD 20M" when source says USD 25M)	0.0 for that carve-out
Value absent (covenant identified but threshold not extracted)	0.25
Value fabricated	0.0 + hallucination flag

D3 = arithmetic mean of individual threshold scores across all found covenants.

D4 — Testing Frequency Accuracy

Result	Score
Exact cadence match (e.g., "quarterly — last day of each financial quarter")	1.0
Cadence correct but period description imprecise (e.g., "every 3 months" vs. "quarterly")	0.75
Cadence off by one level (e.g., quarterly reported as semi-annual)	0.25
Cadence off by more than one level (e.g., quarterly reported as annual)	0.0
Testing frequency not extracted	0.25
Testing frequency not applicable (positive/information covenant with event-based trigger) and correctly noted as "upon occurrence"	1.0

D4 = arithmetic mean of individual frequency scores across applicable covenants.

D5 — Edge Case Handling

This dimension scores whether edge cases (waivers, carve-outs, grace periods) are correctly identified and handled. If no edge cases are present in the document, this dimension scores 1.0 by default (correct absence).

Edge case	Correct handling	Score
Waiver present — correctly identified as temporary with original + waived threshold + scope	Full correct handling	1.0
Waiver present — identified but characterised as permanent	Critical failure	0.0
Waiver present — identified but scope or thresholds incomplete	Partial	0.5
Waiver present — not identified	Miss	0.0
Carve-outs present — all carve-outs listed with caps	Full correct handling	1.0
Carve-outs present — listed without quantitative caps	Partial	0.5
Carve-outs present — not listed at all	Miss	0.0
Grace period present — correctly extracted with trigger and duration	Full correct handling	1.0
Grace period present — not extracted	Miss	0.0
No edge cases present AND system correctly states no waivers/carve-outs found	Correct absence	1.0
No edge cases present AND system fabricates a waiver	Hallucination	0.0

D5 = arithmetic mean of individual edge case scores across all applicable edge cases.

3. Composite Case Score

$$\text{Case Score} = 0.30 \times D1 + 0.20 \times D2 + 0.25 \times D3 + 0.15 \times D4 + 0.10 \times D5$$

Worked example (CM-004 waiver scenario)

Dimension	Score	Weight	Weighted
D1 — Coverage	1.0 (waiver covenant identified)	0.30	0.30
D2 — Type classification	1.0 (correctly financial)	0.20	0.20
D3 — Threshold extraction	0.75 (both thresholds captured; waiver period stated but one date off)	0.25	0.1875
D4 — Testing frequency	1.0 (Q1 2026 testing date correct)	0.15	0.15
D5 — Edge case handling	0.75 (waiver identified as temporary; reversion date not stated)	0.10	0.075

Case Score = 0.30 + 0.20 + 0.1875 + 0.15 + 0.075 = 0.9125

4. False Positive Rate

False positives (covenants identified that are not in the ground truth) are penalised separately.

$$\text{False Positive Penalty} = \frac{\text{False positives identified}}{N_{\text{ground truth covenants}}} \times 0.25$$

This penalty is subtracted from the Case Score. Maximum penalty: 0.25 (so a case with many false positives but zero true positive coverage cannot go below 0.0 due to the penalty alone).

5. Per-Capability Score

$$\text{Covenant Monitoring Score} = \frac{\sum_c \text{Case Score}c}{N{\text{cases}}}$$

Minimum passing threshold: 0.85 (consistent with EVAL-STANDARDS.md Section 6).

6. Hallucination Override

Any fabricated threshold value or fabricated waiver/carve-out that has no basis in the source document sets the entire case score to 0.0. This overrides all other dimension scores.

A waiver incorrectly characterised as permanent (not fabricated, but a critical interpretation error) does not trigger the hallucination override but sets D5 = 0.0 and should be flagged as a Critical Failure in the evaluation log. Critical Failures block release regardless of overall score.

1. Scoring Dimensions​

2. Dimension Scoring Rules​

D1 — Coverage​

D2 — Type Classification Accuracy​

D3 — Threshold Extraction Accuracy​

D4 — Testing Frequency Accuracy​

D5 — Edge Case Handling​

3. Composite Case Score​

Worked example (CM-004 waiver scenario)​

4. False Positive Rate​

5. Per-Capability Score​

6. Hallucination Override​