Eval Standards — AI Output Quality Contract
This document is the product contract for AI output quality. It is the authoritative definition of what "correct" means for every Smartflow AI capability. All evaluation cases, scoring rubrics, and CI acceptance gates derive from this document.
Shared audience: Product, QA, Engineering.
Do not modify scoring thresholds, tier weights, or hallucination policy without a product review and a version bump of this file.
1. Loan Onboarding Extraction
1.1 Outcome Definitions
| Label | Definition |
|---|---|
| Correct | Field value matches the source document exactly (character-for-character for dates, ISIN-style identifiers, and numeric amounts; case-insensitive for party names). Citation points to the correct page and section. |
| Semantic match | Value is informationally equivalent but formatted differently (e.g., USD 500,000,000 vs. $500M vs. USD 500m). The underlying meaning and magnitude are identical. Citation is correct and precise. |
| Partial | Value is partially correct (e.g., extracting a facility amount without currency, or capturing the wrong tranche amount from a multi-tranche structure), OR citation is imprecise (correct section but wrong page number). Either condition independently triggers Partial. |
| Fail | Value is factually wrong, a required field is not extracted when clearly present in the document, or the citation is to a location that does not contain the claimed value. |
| Hallucination — zero tolerance | The extracted value is not present anywhere in the source document (fabricated value), OR the citation references a location that does not exist in the document (fabricated citation). Any hallucination in a case sets the entire case score to 0.0 regardless of other field scores. |
1.2 Required Fields
All 16 fields below are required for every extraction case. A missing required field (when absent from the document and correctly reported as absent) scores 1.0 for that field. A field silently absent from the extraction output when the source document contains it scores 0.0.
| Field | Notes |
|---|---|
| Borrower | Legal entity name as stated on the signature block or recitals |
| Guarantors | List of all guarantors; empty list acceptable if none stated |
| Facility Agent | Administrative/facility agent legal name |
| Facility Amount | Numeric value and currency |
| Currency | ISO 4217 code (USD, SGD, HKD, EUR, etc.) |
| Facility Type | Term loan / revolving credit facility / delayed-draw / other (verbatim if non-standard) |
| Tenor | Period expressed (e.g., 36 months, 5 years) |
| Maturity Date | Exact date in ISO 8601 format (YYYY-MM-DD) |
| Margin / Spread | Numeric value in basis points or percentage per annum |
| Reference Rate | SOFR / SONIA / EURIBOR / HIBOR / Term SOFR + tenor; state if fixed |
| Commitment Fee | Percentage per annum on undrawn amounts; N/A if absence confirmed in document |
| Repayment Schedule | Bullet / amortising / other — with frequency and dates if stated |
| Governing Law | Jurisdiction (e.g., English law, Singapore law) |
| Conditions Precedent | Summary list of CP items; not required to list every sub-clause |
| Material Adverse Change (MAC) clause | Y / N — binary; no interpretation of MAC trigger required |
| Negative Pledge | Y / N — binary; no paraphrase required |
1.3 Field Tiers and Weights
Tier weights apply in aggregated case scoring (see Extraction Accuracy Rubric).
| Tier | Fields | Weight multiplier |
|---|---|---|
| Tier 1 — Deal-critical | Borrower, Facility Amount, Currency, Maturity Date, Margin/Spread | 3.0× |
| Tier 2 — Important | Governing Law, Repayment Schedule, Reference Rate | 1.5× |
| Tier 3 — Supporting | All remaining required fields (Guarantors, Facility Agent, Facility Type, Tenor, Commitment Fee, Conditions Precedent, MAC clause, Negative Pledge) | 1.0× |
1.4 Amendment Documents
When the input document is an amendment notice (not the original agreement):
- The extraction output MUST flag that it is processing an amendment, not a primary agreement.
- Fields being amended MUST be listed as "amended" with: original value, new value, and effective date.
- Fields not amended MUST be marked "unchanged — refer to original agreement" (not re-extracted from the original).
- If the effective date is not stated, the system MUST flag it as unclear (not silently assume the signing date).
1.5 Multi-Tranche Documents
When the document describes multiple tranches:
- Each tranche must be extracted separately.
- Shared fields (e.g., Borrower, Governing Law) must be extracted once and noted as shared.
- Tranche-specific fields (Amount, Margin, Maturity, Reference Rate) must be extracted per tranche with explicit tranche labels.
- If a field applies to some tranches but not others, the per-tranche scope must be stated.
1.6 Confidence Scoring
The system assigns a field-level confidence score (LOW / MEDIUM / HIGH) based on signal clarity. HITL review is triggered at threshold:
| Confidence | HITL trigger | Interpretation |
|---|---|---|
| HIGH (≥ 0.90) | No — auto-accept if extraction rules pass | Model is confident; spot-check only |
| MEDIUM (0.70–0.89) | Flag for review | Model is uncertain; reviewer should verify |
| LOW (< 0.70) | Required — block export until reviewed | Model is not confident; field may be illegible, ambiguous, or absent |
For degraded/OCR documents: it is correct behavior to output LOW confidence and trigger HITL. It is failure behavior to output HIGH confidence on an illegible field.
2. Covenant Monitoring
2.1 Outcome Definitions
| Label | Definition |
|---|---|
| Correct | Covenant identified by name, type classified correctly per taxonomy below, threshold extracted verbatim from source, testing frequency stated correctly (quarterly, semi-annual, annual), and effective date correct (or stated as matching original agreement). |
| Partial | Covenant found but threshold is ambiguous or expressed as a range when the document states a single value; OR testing period inferred without explicit source citation; OR type classification is adjacent (e.g., "maintenance" instead of "financial"). |
| Fail | Covenant missed entirely; threshold extracted with wrong value; type misclassified as a different taxonomy class; testing frequency wrong by more than one period level (e.g., quarterly reported as annual). |
2.2 Covenant Type Taxonomy
All covenants must be classified into exactly one of the following four types. If a covenant straddles two types, classify it by its primary obligation and note the secondary type.
| Type | Definition | Examples |
|---|---|---|
| Financial | Obligation expressed as a quantitative ratio or threshold the borrower must maintain or not breach | Leverage Ratio ≤ 4.0×, Interest Coverage Ratio ≥ 2.5×, DSCR ≥ 1.2×, Net Debt / EBITDA |
| Information | Obligation to deliver documents, financial statements, or notices to the facility agent or lenders within a specified period | Annual audited accounts within 120 days of fiscal year end, quarterly management accounts within 60 days |
| Negative | Prohibition on taking certain actions without lender consent | No additional financial indebtedness above threshold, no disposal of material assets without consent, no change of business |
| Positive | Affirmative obligation to maintain a condition or take an action | Maintain adequate insurance, maintain compliance with applicable laws, maintain existing business |
2.3 Edge Case Handling — MANDATORY FLAGS
The following conditions MUST be explicitly flagged in the output. Silently passing over them is scored as Fail for the relevant covenant.
| Edge case | Required output |
|---|---|
| Waiver | State: (a) what covenant is waived, (b) the original threshold, (c) the waived threshold or relaxed condition, (d) the waiver period or effective testing date range, (e) explicit flag that this is temporary and the underlying covenant survives the waiver |
| Carve-out | List each carve-out from a negative covenant with its scope and any cap (e.g., "permitted financial indebtedness: subsidiary debt up to USD 25,000,000 in aggregate") |
| Grace period | State the number of grace period days and the triggering events that initiate the grace period |
| Cross-default / Cross-acceleration | Flag if present and extract applicable threshold |
3. Benchmark Terms Validation (BMT)
3.1 Outcome Definitions
| Label | Definition |
|---|---|
| Correct | A deviation from the APLMA or LMA benchmark is correctly identified. The output states: (a) the deal term value, (b) the benchmark reference value, (c) the quantified gap (in bps, percentage points, or qualitative description), and (d) the correct benchmark publication or clause reference. |
| True negative | No benchmark deviation is found AND the output confirms this by citing at least one applicable benchmark reference that agrees with the deal term. A true negative is not silence — it requires affirmative confirmation. |
| Fail | Deviation missed; incorrect benchmark cited (e.g., LMA standard applied to an APLMA deal); jurisdiction mismatch; deviation flagged for a term that is market-standard (false positive without evidence); benchmark quantification wrong by more than 20 bps. |
3.2 Jurisdiction Matching
Benchmark selection is mandatory and must match the deal's governing law and market:
| Deal jurisdiction / market | Required benchmark |
|---|---|
| Singapore law, APAC market | APLMA (Asia Pacific Loan Market Association) |
| English law, European market | LMA (Loan Market Association) |
| New York law, US market | LSTA (Loan Syndications and Trading Association) |
| Mixed / unclear | Flag jurisdiction ambiguity; do not apply a default benchmark without disclosure |
3.3 Novel or Non-Standard Structures
When a deal's structure has no corresponding benchmark term (e.g., delayed-draw term loan with unusual commitment fee, bespoke amortisation schedule):
- The output must state that benchmark comparison is inapplicable or limited for that specific term.
- Standard terms within the same deal must still be evaluated against applicable benchmarks.
- The system must not false-flag a non-standard structure as a "deviation" when no benchmark comparator exists.
4. Document Q&A
4.1 Outcome Definitions
| Label | Definition |
|---|---|
| Correct | Answer is factually accurate and directly supported by the cited text. Scope is appropriate (answers only what is in the document). Uncertainty is explicitly acknowledged where the source text is genuinely ambiguous. |
| Grounded refusal | Information requested is not present in the document. Output explicitly states: "The document does not contain information about [topic]." Does not hallucinate a response. Does not provide a generic disclaimer without specifying what is absent. |
| Fail | Factually wrong answer; citation provided does not support the answer when reviewed; out-of-scope claim made with false confidence (e.g., inferring information not present); refusal to answer a question that is clearly answerable from document text; vague non-answer when a specific grounded answer exists. |
4.2 Scope Rules
| Rule | Requirement |
|---|---|
| Document-only answers | All answers must be derived exclusively from the referenced source document. No background knowledge about APLMA standards, market norms, or borrower financials may be imported unless the document itself references them. |
| Uncertainty acknowledgement | If source text uses qualified language (e.g., "as may be determined," "subject to market disruption"), the answer must reflect that qualification. |
| Citation requirement | Every substantive answer must cite the source document by name, clause/section reference, and verbatim quote of the supporting text. |
| Multi-clause synthesis | For questions requiring reading multiple clauses, all relevant sections must be cited. Synthesised answers must be traceable to each component. |
| Grounded refusal format | Must use the format: "The document does not contain information about [X]. If this information is required, it may be found in [Y — only if plausible]." Do not say "I don't know." |
5. Cross-Cutting Provenance Standard
This standard applies to all capability outputs without exception.
5.1 Required Provenance Elements
Every extracted field, covenant identification, BMT comparison, and Q&A answer must include all three elements:
| Element | Requirement |
|---|---|
| Source document name | Exact filename as submitted (e.g., acme-corp-facility-agreement.pdf) |
| Page / section reference | Page number AND clause/section identifier (e.g., Page 14, Clause 5.1(a)) |
| Verbatim quote | Exact text from the source document, sufficient to support the output. Must be findable by a human reviewer searching the source PDF. |
5.2 Provenance Failure Modes
| Failure | Classification |
|---|---|
| Citation present but points to wrong page | Partial (provenance score 0.5) |
| Citation present, page correct, but quote paraphrased not verbatim | Partial (provenance score 0.5) |
| Citation absent entirely | Fail (provenance score 0.0) |
| Quote is verbatim but does not support the claimed extracted value | Fail (provenance score 0.0) |
| Quote text cannot be found in source PDF by human reviewer | Hallucination (case score 0.0) |
6. Passing Thresholds
These thresholds apply to all production-grade evaluations. Falling below any threshold requires a remediation sprint before release.
| Scope | Threshold | Notes |
|---|---|---|
| Per-capability score | ≥ 0.85 | Weighted average across all cases in capability set |
| Tier 1 field score (per field) | ≥ 0.70 | Any single Tier 1 field below 0.70 blocks release |
| Hallucination rate | 0.0 | Zero tolerance — any hallucination requires root cause analysis before re-run |
| Provenance completeness | ≥ 0.90 | Proportion of outputs with full, correct provenance |
| Grounded refusal accuracy | ≥ 0.95 | Out-of-scope questions must be correctly refused |
7. Version History
| Version | Date | Change |
|---|---|---|
| 1.0 | 2026-03-25 | Initial release — covers Loan Onboarding, Covenant Monitoring, BMT, Document Q&A |