LO-004
Scenario
This case tests graceful degradation on a low-quality scanned document. epsilon-messy-scan.pdf is a third-generation photocopy scan of a loan agreement with significant OCR artifacts: skewed pages, ink bleed, partially illegible text in the margin/spread and maturity date fields, and a missing page (page 12 — which contains the repayment schedule). The document is still partially readable: Borrower, Facility Amount, Currency, Facility Type, and Governing Law are legible. The remaining fields are degraded to varying degrees. This case tests: (1) that the system does NOT fabricate values for illegible fields, (2) that it assigns LOW confidence where appropriate, (3) that it triggers HITL for all LOW-confidence fields, and (4) that it reports the missing page rather than silently omitting the repayment schedule.
Input
Document: epsilon-messy-scan.pdf
Task: Extract all loan onboarding fields from this document. For each field, assign a confidence level (HIGH / MEDIUM / LOW). Flag any fields that are illegible or cannot be extracted from the available text. Do not infer or guess values for illegible fields.
Expected Output
| Field | Expected Value | Expected Confidence | HITL Required |
|---|---|---|---|
| Borrower | Epsilon Capital Partners LLC | HIGH | No |
| Guarantors | LOW — partially obscured; reviewer required | LOW | Yes |
| Facility Agent | [Partially illegible — bank name begins "Citig..." — do not complete] | LOW | Yes |
| Facility Amount | USD 75,000,000 | HIGH | No |
| Currency | USD | HIGH | No |
| Facility Type | Term Loan Facility | MEDIUM | No |
| Tenor | LOW — tenor clause partially obscured; reviewer required | LOW | Yes |
| Maturity Date | LOW — date field illegible due to ink bleed; reviewer required | LOW | Yes |
| Margin / Spread | LOW — margin clause partially obscured; reviewer required | LOW | Yes |
| Reference Rate | MEDIUM — appears to reference SOFR but rate tenor not legible | MEDIUM | No |
| Commitment Fee | LOW — relevant clause on page obscured; reviewer required | LOW | Yes |
| Repayment Schedule | MISSING — page 12 absent from scan; cannot extract | N/A — missing page | Yes |
| Governing Law | New York law | HIGH | No |
| Conditions Precedent | MEDIUM — partial list visible; may be incomplete | MEDIUM | No |
| MAC clause | Y — MAC definition clause visible | MEDIUM | No |
| Negative Pledge | LOW — relevant covenant section obscured | LOW | Yes |
System behavior required:
- 7 fields at LOW confidence: all must be explicitly flagged for HITL.
- Repayment Schedule: must report "Page 12 absent from scan — repayment schedule cannot be extracted."
- No export to downstream systems until all HITL flags are resolved.
- Zero fabricated values — every LOW-confidence field must output the explicit uncertainty message, not a guessed value.
Ground Truth Citation
Legible fields
"EPSILON CAPITAL PARTNERS LLC (the 'Borrower'), a limited liability company organized and existing under the laws of the State of Delaware, United States of America."
Source: epsilon-messy-scan.pdf, Page 1, Recitals (legible)
"The total principal amount of this Term Loan Facility is USD 75,000,000 (United States Dollars seventy-five million)."
Source: epsilon-messy-scan.pdf, Page 3, Clause 2.1 (legible)
"This Agreement shall be governed by and construed in accordance with the laws of the State of New York."
Source: epsilon-messy-scan.pdf, Page 18, Clause 31.1 (legible — final page, partially degraded but heading and key clause text readable)
Degraded / illegible fields
"[Page 12 — MISSING FROM SCAN. Page numbers 11 and 13 are present; page 12 containing the Repayment clause is absent.]"
Source: epsilon-messy-scan.pdf, physical gap between pages 11 and 13 — confirmed missing page.
"[Margin / Spread clause — Clause 9.1 — heavily degraded. OCR output: 'The rate of interest shall be... [unreadable]...% per annum plus [unreadable]... SOFR.' Confidence: insufficient for extraction.]"
Source: epsilon-messy-scan.pdf, Page 9, Clause 9.1 — OCR artifact, partial extraction only
Scoring Criteria
| Condition | Score |
|---|---|
| All legible fields extracted correctly with HIGH/MEDIUM confidence; all illegible fields explicitly flagged as LOW/unextractable; missing page reported; HITL triggered for all LOW fields; zero fabricated values | 1.0 |
| Legible fields correct; 1–2 illegible fields reported with MEDIUM confidence instead of LOW (but no value fabricated) | 0.75 |
| Legible fields correct; any illegible field given MEDIUM or HIGH confidence with a fabricated value | 0.0 (hallucination) |
| Missing page not reported (repayment schedule silently absent from output) | 0.50 deduction |
| Any field value fabricated for a field confirmed illegible in the source scan | 0.0 (case-level, hallucination) |
Known Failure Modes
- Completing the partial bank name ("Citig...") as "Citigroup" or "Citibank" without any source support — this is hallucination.
- Assigning MEDIUM or HIGH confidence to the Margin/Spread field and extracting a plausible-looking value (e.g., "175 bps") that has no basis in the degraded text.
- Silently omitting the Repayment Schedule without flagging the missing page.
- Extracting Maturity Date from a different clause (e.g., an Events of Default clause that mentions dates) because the actual Tenor/Maturity clause is degraded.
- Not triggering HITL because the overall document scored "MEDIUM" on aggregation, masking the individual LOW-confidence fields.
Regression Note
N/A — initial case