Skip to main content

Eval Standards — AI Output Quality Contract

This document is the product contract for AI output quality. It is the authoritative definition of what "correct" means for every Smartflow AI capability. All evaluation cases, scoring rubrics, and CI acceptance gates derive from this document.

Shared audience: Product, QA, Engineering.

Normative document

Do not modify scoring thresholds, tier weights, or hallucination policy without a product review and a version bump of this file.


1. Loan Onboarding Extraction

1.1 Outcome Definitions

LabelDefinition
CorrectField value matches the source document exactly (character-for-character for dates, ISIN-style identifiers, and numeric amounts; case-insensitive for party names). Citation points to the correct page and section.
Semantic matchValue is informationally equivalent but formatted differently (e.g., USD 500,000,000 vs. $500M vs. USD 500m). The underlying meaning and magnitude are identical. Citation is correct and precise.
PartialValue is partially correct (e.g., extracting a facility amount without currency, or capturing the wrong tranche amount from a multi-tranche structure), OR citation is imprecise (correct section but wrong page number). Either condition independently triggers Partial.
FailValue is factually wrong, a required field is not extracted when clearly present in the document, or the citation is to a location that does not contain the claimed value.
Hallucination — zero toleranceThe extracted value is not present anywhere in the source document (fabricated value), OR the citation references a location that does not exist in the document (fabricated citation). Any hallucination in a case sets the entire case score to 0.0 regardless of other field scores.

1.2 Required Fields

All 16 fields below are required for every extraction case. A missing required field (when absent from the document and correctly reported as absent) scores 1.0 for that field. A field silently absent from the extraction output when the source document contains it scores 0.0.

FieldNotes
BorrowerLegal entity name as stated on the signature block or recitals
GuarantorsList of all guarantors; empty list acceptable if none stated
Facility AgentAdministrative/facility agent legal name
Facility AmountNumeric value and currency
CurrencyISO 4217 code (USD, SGD, HKD, EUR, etc.)
Facility TypeTerm loan / revolving credit facility / delayed-draw / other (verbatim if non-standard)
TenorPeriod expressed (e.g., 36 months, 5 years)
Maturity DateExact date in ISO 8601 format (YYYY-MM-DD)
Margin / SpreadNumeric value in basis points or percentage per annum
Reference RateSOFR / SONIA / EURIBOR / HIBOR / Term SOFR + tenor; state if fixed
Commitment FeePercentage per annum on undrawn amounts; N/A if absence confirmed in document
Repayment ScheduleBullet / amortising / other — with frequency and dates if stated
Governing LawJurisdiction (e.g., English law, Singapore law)
Conditions PrecedentSummary list of CP items; not required to list every sub-clause
Material Adverse Change (MAC) clauseY / N — binary; no interpretation of MAC trigger required
Negative PledgeY / N — binary; no paraphrase required

1.3 Field Tiers and Weights

Tier weights apply in aggregated case scoring (see Extraction Accuracy Rubric).

TierFieldsWeight multiplier
Tier 1 — Deal-criticalBorrower, Facility Amount, Currency, Maturity Date, Margin/Spread3.0×
Tier 2 — ImportantGoverning Law, Repayment Schedule, Reference Rate1.5×
Tier 3 — SupportingAll remaining required fields (Guarantors, Facility Agent, Facility Type, Tenor, Commitment Fee, Conditions Precedent, MAC clause, Negative Pledge)1.0×

1.4 Amendment Documents

When the input document is an amendment notice (not the original agreement):

  • The extraction output MUST flag that it is processing an amendment, not a primary agreement.
  • Fields being amended MUST be listed as "amended" with: original value, new value, and effective date.
  • Fields not amended MUST be marked "unchanged — refer to original agreement" (not re-extracted from the original).
  • If the effective date is not stated, the system MUST flag it as unclear (not silently assume the signing date).

1.5 Multi-Tranche Documents

When the document describes multiple tranches:

  • Each tranche must be extracted separately.
  • Shared fields (e.g., Borrower, Governing Law) must be extracted once and noted as shared.
  • Tranche-specific fields (Amount, Margin, Maturity, Reference Rate) must be extracted per tranche with explicit tranche labels.
  • If a field applies to some tranches but not others, the per-tranche scope must be stated.

1.6 Confidence Scoring

The system assigns a field-level confidence score (LOW / MEDIUM / HIGH) based on signal clarity. HITL review is triggered at threshold:

ConfidenceHITL triggerInterpretation
HIGH (≥ 0.90)No — auto-accept if extraction rules passModel is confident; spot-check only
MEDIUM (0.70–0.89)Flag for reviewModel is uncertain; reviewer should verify
LOW (< 0.70)Required — block export until reviewedModel is not confident; field may be illegible, ambiguous, or absent

For degraded/OCR documents: it is correct behavior to output LOW confidence and trigger HITL. It is failure behavior to output HIGH confidence on an illegible field.


2. Covenant Monitoring

2.1 Outcome Definitions

LabelDefinition
CorrectCovenant identified by name, type classified correctly per taxonomy below, threshold extracted verbatim from source, testing frequency stated correctly (quarterly, semi-annual, annual), and effective date correct (or stated as matching original agreement).
PartialCovenant found but threshold is ambiguous or expressed as a range when the document states a single value; OR testing period inferred without explicit source citation; OR type classification is adjacent (e.g., "maintenance" instead of "financial").
FailCovenant missed entirely; threshold extracted with wrong value; type misclassified as a different taxonomy class; testing frequency wrong by more than one period level (e.g., quarterly reported as annual).

2.2 Covenant Type Taxonomy

All covenants must be classified into exactly one of the following four types. If a covenant straddles two types, classify it by its primary obligation and note the secondary type.

TypeDefinitionExamples
FinancialObligation expressed as a quantitative ratio or threshold the borrower must maintain or not breachLeverage Ratio ≤ 4.0×, Interest Coverage Ratio ≥ 2.5×, DSCR ≥ 1.2×, Net Debt / EBITDA
InformationObligation to deliver documents, financial statements, or notices to the facility agent or lenders within a specified periodAnnual audited accounts within 120 days of fiscal year end, quarterly management accounts within 60 days
NegativeProhibition on taking certain actions without lender consentNo additional financial indebtedness above threshold, no disposal of material assets without consent, no change of business
PositiveAffirmative obligation to maintain a condition or take an actionMaintain adequate insurance, maintain compliance with applicable laws, maintain existing business

2.3 Edge Case Handling — MANDATORY FLAGS

The following conditions MUST be explicitly flagged in the output. Silently passing over them is scored as Fail for the relevant covenant.

Edge caseRequired output
WaiverState: (a) what covenant is waived, (b) the original threshold, (c) the waived threshold or relaxed condition, (d) the waiver period or effective testing date range, (e) explicit flag that this is temporary and the underlying covenant survives the waiver
Carve-outList each carve-out from a negative covenant with its scope and any cap (e.g., "permitted financial indebtedness: subsidiary debt up to USD 25,000,000 in aggregate")
Grace periodState the number of grace period days and the triggering events that initiate the grace period
Cross-default / Cross-accelerationFlag if present and extract applicable threshold

3. Benchmark Terms Validation (BMT)

3.1 Outcome Definitions

LabelDefinition
CorrectA deviation from the APLMA or LMA benchmark is correctly identified. The output states: (a) the deal term value, (b) the benchmark reference value, (c) the quantified gap (in bps, percentage points, or qualitative description), and (d) the correct benchmark publication or clause reference.
True negativeNo benchmark deviation is found AND the output confirms this by citing at least one applicable benchmark reference that agrees with the deal term. A true negative is not silence — it requires affirmative confirmation.
FailDeviation missed; incorrect benchmark cited (e.g., LMA standard applied to an APLMA deal); jurisdiction mismatch; deviation flagged for a term that is market-standard (false positive without evidence); benchmark quantification wrong by more than 20 bps.

3.2 Jurisdiction Matching

Benchmark selection is mandatory and must match the deal's governing law and market:

Deal jurisdiction / marketRequired benchmark
Singapore law, APAC marketAPLMA (Asia Pacific Loan Market Association)
English law, European marketLMA (Loan Market Association)
New York law, US marketLSTA (Loan Syndications and Trading Association)
Mixed / unclearFlag jurisdiction ambiguity; do not apply a default benchmark without disclosure

3.3 Novel or Non-Standard Structures

When a deal's structure has no corresponding benchmark term (e.g., delayed-draw term loan with unusual commitment fee, bespoke amortisation schedule):

  • The output must state that benchmark comparison is inapplicable or limited for that specific term.
  • Standard terms within the same deal must still be evaluated against applicable benchmarks.
  • The system must not false-flag a non-standard structure as a "deviation" when no benchmark comparator exists.

4. Document Q&A

4.1 Outcome Definitions

LabelDefinition
CorrectAnswer is factually accurate and directly supported by the cited text. Scope is appropriate (answers only what is in the document). Uncertainty is explicitly acknowledged where the source text is genuinely ambiguous.
Grounded refusalInformation requested is not present in the document. Output explicitly states: "The document does not contain information about [topic]." Does not hallucinate a response. Does not provide a generic disclaimer without specifying what is absent.
FailFactually wrong answer; citation provided does not support the answer when reviewed; out-of-scope claim made with false confidence (e.g., inferring information not present); refusal to answer a question that is clearly answerable from document text; vague non-answer when a specific grounded answer exists.

4.2 Scope Rules

RuleRequirement
Document-only answersAll answers must be derived exclusively from the referenced source document. No background knowledge about APLMA standards, market norms, or borrower financials may be imported unless the document itself references them.
Uncertainty acknowledgementIf source text uses qualified language (e.g., "as may be determined," "subject to market disruption"), the answer must reflect that qualification.
Citation requirementEvery substantive answer must cite the source document by name, clause/section reference, and verbatim quote of the supporting text.
Multi-clause synthesisFor questions requiring reading multiple clauses, all relevant sections must be cited. Synthesised answers must be traceable to each component.
Grounded refusal formatMust use the format: "The document does not contain information about [X]. If this information is required, it may be found in [Y — only if plausible]." Do not say "I don't know."

5. Cross-Cutting Provenance Standard

This standard applies to all capability outputs without exception.

5.1 Required Provenance Elements

Every extracted field, covenant identification, BMT comparison, and Q&A answer must include all three elements:

ElementRequirement
Source document nameExact filename as submitted (e.g., acme-corp-facility-agreement.pdf)
Page / section referencePage number AND clause/section identifier (e.g., Page 14, Clause 5.1(a))
Verbatim quoteExact text from the source document, sufficient to support the output. Must be findable by a human reviewer searching the source PDF.

5.2 Provenance Failure Modes

FailureClassification
Citation present but points to wrong pagePartial (provenance score 0.5)
Citation present, page correct, but quote paraphrased not verbatimPartial (provenance score 0.5)
Citation absent entirelyFail (provenance score 0.0)
Quote is verbatim but does not support the claimed extracted valueFail (provenance score 0.0)
Quote text cannot be found in source PDF by human reviewerHallucination (case score 0.0)

6. Passing Thresholds

These thresholds apply to all production-grade evaluations. Falling below any threshold requires a remediation sprint before release.

ScopeThresholdNotes
Per-capability score≥ 0.85Weighted average across all cases in capability set
Tier 1 field score (per field)≥ 0.70Any single Tier 1 field below 0.70 blocks release
Hallucination rate0.0Zero tolerance — any hallucination requires root cause analysis before re-run
Provenance completeness≥ 0.90Proportion of outputs with full, correct provenance
Grounded refusal accuracy≥ 0.95Out-of-scope questions must be correctly refused

7. Version History

VersionDateChange
1.02026-03-25Initial release — covers Loan Onboarding, Covenant Monitoring, BMT, Document Q&A