Skip to main content

Covenant Identification Rubric

This rubric applies to all Covenant Monitoring evaluations. It scores covenant identification across five independent dimensions. All dimensions must be scored for a complete evaluation; the composite is a weighted average.

Covenant type taxonomy is defined in EVAL-STANDARDS.md Section 2.


1. Scoring Dimensions

DimensionWeightDescription
D1 — Coverage30%Proportion of covenants in the ground truth that were found
D2 — Type classification accuracy20%Each found covenant correctly classified by type (financial / information / negative / positive)
D3 — Threshold extraction accuracy25%Threshold or obligation value correctly extracted verbatim
D4 — Testing frequency accuracy15%Correct cadence (quarterly / semi-annual / annual / per the trigger)
D5 — Edge case handling10%Waivers, carve-outs, grace periods — correctly flagged or correctly noted as absent

2. Dimension Scoring Rules

D1 — Coverage

Coverage measures recall: how many of the covenants in the ground truth were successfully identified?

$$D1 = \frac{\text{Covenants correctly identified}}{\text{Total covenants in ground truth}}$$

A covenant is "correctly identified" if its name and subject matter are recognisable in the output (even if threshold or type is wrong — those deductions apply in D3 and D2 respectively).

CoverageD1 Score
100% (all covenants found)1.0
90–99%0.9
75–89%0.75
50–74%0.50
< 50%0.25
0% (no covenants found)0.0

D2 — Type Classification Accuracy

Score each found covenant's type classification independently:

Classification resultScore
Correct type (matches EVAL-STANDARDS taxonomy exactly)1.0
Adjacent type (e.g., "maintenance" as a non-taxonomy label that maps to "financial")0.75
Wrong type within taxonomy (e.g., financial classified as positive)0.0
No type provided0.25

D2 = arithmetic mean of individual classification scores across all found covenants.

D3 — Threshold Extraction Accuracy

For financial covenants: score the extracted ratio or threshold value.
For information covenants: score the delivery deadline (in days).
For negative covenants: score the principal prohibition AND any quantitative carve-out caps.
For positive covenants: score the obligation description accuracy.

Extraction resultScore
Verbatim value correct (e.g., "≤ 4.0×", "within 120 days")1.0
Value correct but paraphrased (e.g., "4 times" instead of "4.0×")0.75
Value partially correct (e.g., threshold correct but testing period wrong)0.50
Carve-out cap extracted but with wrong value (e.g., "USD 20M" when source says USD 25M)0.0 for that carve-out
Value absent (covenant identified but threshold not extracted)0.25
Value fabricated0.0 + hallucination flag

D3 = arithmetic mean of individual threshold scores across all found covenants.

D4 — Testing Frequency Accuracy

ResultScore
Exact cadence match (e.g., "quarterly — last day of each financial quarter")1.0
Cadence correct but period description imprecise (e.g., "every 3 months" vs. "quarterly")0.75
Cadence off by one level (e.g., quarterly reported as semi-annual)0.25
Cadence off by more than one level (e.g., quarterly reported as annual)0.0
Testing frequency not extracted0.25
Testing frequency not applicable (positive/information covenant with event-based trigger) and correctly noted as "upon occurrence"1.0

D4 = arithmetic mean of individual frequency scores across applicable covenants.

D5 — Edge Case Handling

This dimension scores whether edge cases (waivers, carve-outs, grace periods) are correctly identified and handled. If no edge cases are present in the document, this dimension scores 1.0 by default (correct absence).

Edge caseCorrect handlingScore
Waiver present — correctly identified as temporary with original + waived threshold + scopeFull correct handling1.0
Waiver present — identified but characterised as permanentCritical failure0.0
Waiver present — identified but scope or thresholds incompletePartial0.5
Waiver present — not identifiedMiss0.0
Carve-outs present — all carve-outs listed with capsFull correct handling1.0
Carve-outs present — listed without quantitative capsPartial0.5
Carve-outs present — not listed at allMiss0.0
Grace period present — correctly extracted with trigger and durationFull correct handling1.0
Grace period present — not extractedMiss0.0
No edge cases present AND system correctly states no waivers/carve-outs foundCorrect absence1.0
No edge cases present AND system fabricates a waiverHallucination0.0

D5 = arithmetic mean of individual edge case scores across all applicable edge cases.


3. Composite Case Score

$$\text{Case Score} = 0.30 \times D1 + 0.20 \times D2 + 0.25 \times D3 + 0.15 \times D4 + 0.10 \times D5$$

Worked example (CM-004 waiver scenario)

DimensionScoreWeightWeighted
D1 — Coverage1.0 (waiver covenant identified)0.300.30
D2 — Type classification1.0 (correctly financial)0.200.20
D3 — Threshold extraction0.75 (both thresholds captured; waiver period stated but one date off)0.250.1875
D4 — Testing frequency1.0 (Q1 2026 testing date correct)0.150.15
D5 — Edge case handling0.75 (waiver identified as temporary; reversion date not stated)0.100.075

Case Score = 0.30 + 0.20 + 0.1875 + 0.15 + 0.075 = 0.9125


4. False Positive Rate

False positives (covenants identified that are not in the ground truth) are penalised separately.

$$\text{False Positive Penalty} = \frac{\text{False positives identified}}{N_{\text{ground truth covenants}}} \times 0.25$$

This penalty is subtracted from the Case Score. Maximum penalty: 0.25 (so a case with many false positives but zero true positive coverage cannot go below 0.0 due to the penalty alone).


5. Per-Capability Score

$$\text{Covenant Monitoring Score} = \frac{\sum_c \text{Case Score}c}{N{\text{cases}}}$$

Minimum passing threshold: 0.85 (consistent with EVAL-STANDARDS.md Section 6).


6. Hallucination Override

Any fabricated threshold value or fabricated waiver/carve-out that has no basis in the source document sets the entire case score to 0.0. This overrides all other dimension scores.

A waiver incorrectly characterised as permanent (not fabricated, but a critical interpretation error) does not trigger the hallucination override but sets D5 = 0.0 and should be flagged as a Critical Failure in the evaluation log. Critical Failures block release regardless of overall score.