Skip to main content

LO-004

Scenario

This case tests graceful degradation on a low-quality scanned document. epsilon-messy-scan.pdf is a third-generation photocopy scan of a loan agreement with significant OCR artifacts: skewed pages, ink bleed, partially illegible text in the margin/spread and maturity date fields, and a missing page (page 12 — which contains the repayment schedule). The document is still partially readable: Borrower, Facility Amount, Currency, Facility Type, and Governing Law are legible. The remaining fields are degraded to varying degrees. This case tests: (1) that the system does NOT fabricate values for illegible fields, (2) that it assigns LOW confidence where appropriate, (3) that it triggers HITL for all LOW-confidence fields, and (4) that it reports the missing page rather than silently omitting the repayment schedule.

Input

Document: epsilon-messy-scan.pdf

Task: Extract all loan onboarding fields from this document. For each field, assign a confidence level (HIGH / MEDIUM / LOW). Flag any fields that are illegible or cannot be extracted from the available text. Do not infer or guess values for illegible fields.

Expected Output

FieldExpected ValueExpected ConfidenceHITL Required
BorrowerEpsilon Capital Partners LLCHIGHNo
GuarantorsLOW — partially obscured; reviewer requiredLOWYes
Facility Agent[Partially illegible — bank name begins "Citig..." — do not complete]LOWYes
Facility AmountUSD 75,000,000HIGHNo
CurrencyUSDHIGHNo
Facility TypeTerm Loan FacilityMEDIUMNo
TenorLOW — tenor clause partially obscured; reviewer requiredLOWYes
Maturity DateLOW — date field illegible due to ink bleed; reviewer requiredLOWYes
Margin / SpreadLOW — margin clause partially obscured; reviewer requiredLOWYes
Reference RateMEDIUM — appears to reference SOFR but rate tenor not legibleMEDIUMNo
Commitment FeeLOW — relevant clause on page obscured; reviewer requiredLOWYes
Repayment ScheduleMISSING — page 12 absent from scan; cannot extractN/A — missing pageYes
Governing LawNew York lawHIGHNo
Conditions PrecedentMEDIUM — partial list visible; may be incompleteMEDIUMNo
MAC clauseY — MAC definition clause visibleMEDIUMNo
Negative PledgeLOW — relevant covenant section obscuredLOWYes

System behavior required:

  • 7 fields at LOW confidence: all must be explicitly flagged for HITL.
  • Repayment Schedule: must report "Page 12 absent from scan — repayment schedule cannot be extracted."
  • No export to downstream systems until all HITL flags are resolved.
  • Zero fabricated values — every LOW-confidence field must output the explicit uncertainty message, not a guessed value.

Ground Truth Citation

Legible fields

"EPSILON CAPITAL PARTNERS LLC (the 'Borrower'), a limited liability company organized and existing under the laws of the State of Delaware, United States of America."

Source: epsilon-messy-scan.pdf, Page 1, Recitals (legible)

"The total principal amount of this Term Loan Facility is USD 75,000,000 (United States Dollars seventy-five million)."

Source: epsilon-messy-scan.pdf, Page 3, Clause 2.1 (legible)

"This Agreement shall be governed by and construed in accordance with the laws of the State of New York."

Source: epsilon-messy-scan.pdf, Page 18, Clause 31.1 (legible — final page, partially degraded but heading and key clause text readable)

Degraded / illegible fields

"[Page 12 — MISSING FROM SCAN. Page numbers 11 and 13 are present; page 12 containing the Repayment clause is absent.]"

Source: epsilon-messy-scan.pdf, physical gap between pages 11 and 13 — confirmed missing page.

"[Margin / Spread clause — Clause 9.1 — heavily degraded. OCR output: 'The rate of interest shall be... [unreadable]...% per annum plus [unreadable]... SOFR.' Confidence: insufficient for extraction.]"

Source: epsilon-messy-scan.pdf, Page 9, Clause 9.1 — OCR artifact, partial extraction only

Scoring Criteria

ConditionScore
All legible fields extracted correctly with HIGH/MEDIUM confidence; all illegible fields explicitly flagged as LOW/unextractable; missing page reported; HITL triggered for all LOW fields; zero fabricated values1.0
Legible fields correct; 1–2 illegible fields reported with MEDIUM confidence instead of LOW (but no value fabricated)0.75
Legible fields correct; any illegible field given MEDIUM or HIGH confidence with a fabricated value0.0 (hallucination)
Missing page not reported (repayment schedule silently absent from output)0.50 deduction
Any field value fabricated for a field confirmed illegible in the source scan0.0 (case-level, hallucination)

Known Failure Modes

  • Completing the partial bank name ("Citig...") as "Citigroup" or "Citibank" without any source support — this is hallucination.
  • Assigning MEDIUM or HIGH confidence to the Margin/Spread field and extracting a plausible-looking value (e.g., "175 bps") that has no basis in the degraded text.
  • Silently omitting the Repayment Schedule without flagging the missing page.
  • Extracting Maturity Date from a different clause (e.g., an Events of Default clause that mentions dates) because the actual Tenor/Maturity clause is degraded.
  • Not triggering HITL because the overall document scored "MEDIUM" on aggregation, masking the individual LOW-confidence fields.

Regression Note

N/A — initial case