Eval Standards — AI Output Quality Contract

This document is the product contract for AI output quality. It is the authoritative definition of what "correct" means for every Smartflow AI capability. All evaluation cases, scoring rubrics, and CI acceptance gates derive from this document.

Shared audience: Product, QA, Engineering.

Normative document

Do not modify scoring thresholds, tier weights, or hallucination policy without a product review and a version bump of this file.

1. Loan Onboarding Extraction

1.1 Outcome Definitions

Label	Definition
Correct	Field value matches the source document exactly (character-for-character for dates, ISIN-style identifiers, and numeric amounts; case-insensitive for party names). Citation points to the correct page and section.
Semantic match	Value is informationally equivalent but formatted differently (e.g., `USD 500,000,000` vs. `$500M` vs. `USD 500m`). The underlying meaning and magnitude are identical. Citation is correct and precise.
Partial	Value is partially correct (e.g., extracting a facility amount without currency, or capturing the wrong tranche amount from a multi-tranche structure), OR citation is imprecise (correct section but wrong page number). Either condition independently triggers Partial.
Fail	Value is factually wrong, a required field is not extracted when clearly present in the document, or the citation is to a location that does not contain the claimed value.
Hallucination — zero tolerance	The extracted value is not present anywhere in the source document (fabricated value), OR the citation references a location that does not exist in the document (fabricated citation). Any hallucination in a case sets the entire case score to 0.0 regardless of other field scores.

1.2 Required Fields

All 16 fields below are required for every extraction case. A missing required field (when absent from the document and correctly reported as absent) scores 1.0 for that field. A field silently absent from the extraction output when the source document contains it scores 0.0.

Field	Notes
Borrower	Legal entity name as stated on the signature block or recitals
Guarantors	List of all guarantors; empty list acceptable if none stated
Facility Agent	Administrative/facility agent legal name
Facility Amount	Numeric value and currency
Currency	ISO 4217 code (USD, SGD, HKD, EUR, etc.)
Facility Type	Term loan / revolving credit facility / delayed-draw / other (verbatim if non-standard)
Tenor	Period expressed (e.g., 36 months, 5 years)
Maturity Date	Exact date in ISO 8601 format (YYYY-MM-DD)
Margin / Spread	Numeric value in basis points or percentage per annum
Reference Rate	SOFR / SONIA / EURIBOR / HIBOR / Term SOFR + tenor; state if fixed
Commitment Fee	Percentage per annum on undrawn amounts; N/A if absence confirmed in document
Repayment Schedule	Bullet / amortising / other — with frequency and dates if stated
Governing Law	Jurisdiction (e.g., English law, Singapore law)
Conditions Precedent	Summary list of CP items; not required to list every sub-clause
Material Adverse Change (MAC) clause	Y / N — binary; no interpretation of MAC trigger required
Negative Pledge	Y / N — binary; no paraphrase required

1.3 Field Tiers and Weights

Tier weights apply in aggregated case scoring (see Extraction Accuracy Rubric).

Tier	Fields	Weight multiplier
Tier 1 — Deal-critical	Borrower, Facility Amount, Currency, Maturity Date, Margin/Spread	3.0×
Tier 2 — Important	Governing Law, Repayment Schedule, Reference Rate	1.5×
Tier 3 — Supporting	All remaining required fields (Guarantors, Facility Agent, Facility Type, Tenor, Commitment Fee, Conditions Precedent, MAC clause, Negative Pledge)	1.0×

1.4 Amendment Documents

When the input document is an amendment notice (not the original agreement):

The extraction output MUST flag that it is processing an amendment, not a primary agreement.
Fields being amended MUST be listed as "amended" with: original value, new value, and effective date.
Fields not amended MUST be marked "unchanged — refer to original agreement" (not re-extracted from the original).
If the effective date is not stated, the system MUST flag it as unclear (not silently assume the signing date).

1.5 Multi-Tranche Documents

When the document describes multiple tranches:

Each tranche must be extracted separately.
Shared fields (e.g., Borrower, Governing Law) must be extracted once and noted as shared.
Tranche-specific fields (Amount, Margin, Maturity, Reference Rate) must be extracted per tranche with explicit tranche labels.
If a field applies to some tranches but not others, the per-tranche scope must be stated.

1.6 Confidence Scoring

The system assigns a field-level confidence score (LOW / MEDIUM / HIGH) based on signal clarity. HITL review is triggered at threshold:

Confidence	HITL trigger	Interpretation
HIGH (≥ 0.90)	No — auto-accept if extraction rules pass	Model is confident; spot-check only
MEDIUM (0.70–0.89)	Flag for review	Model is uncertain; reviewer should verify
LOW (< 0.70)	Required — block export until reviewed	Model is not confident; field may be illegible, ambiguous, or absent

For degraded/OCR documents: it is correct behavior to output LOW confidence and trigger HITL. It is failure behavior to output HIGH confidence on an illegible field.

2. Covenant Monitoring

2.1 Outcome Definitions

Label	Definition
Correct	Covenant identified by name, type classified correctly per taxonomy below, threshold extracted verbatim from source, testing frequency stated correctly (quarterly, semi-annual, annual), and effective date correct (or stated as matching original agreement).
Partial	Covenant found but threshold is ambiguous or expressed as a range when the document states a single value; OR testing period inferred without explicit source citation; OR type classification is adjacent (e.g., "maintenance" instead of "financial").
Fail	Covenant missed entirely; threshold extracted with wrong value; type misclassified as a different taxonomy class; testing frequency wrong by more than one period level (e.g., quarterly reported as annual).

2.2 Covenant Type Taxonomy

All covenants must be classified into exactly one of the following four types. If a covenant straddles two types, classify it by its primary obligation and note the secondary type.

Type	Definition	Examples
Financial	Obligation expressed as a quantitative ratio or threshold the borrower must maintain or not breach	Leverage Ratio ≤ 4.0×, Interest Coverage Ratio ≥ 2.5×, DSCR ≥ 1.2×, Net Debt / EBITDA
Information	Obligation to deliver documents, financial statements, or notices to the facility agent or lenders within a specified period	Annual audited accounts within 120 days of fiscal year end, quarterly management accounts within 60 days
Negative	Prohibition on taking certain actions without lender consent	No additional financial indebtedness above threshold, no disposal of material assets without consent, no change of business
Positive	Affirmative obligation to maintain a condition or take an action	Maintain adequate insurance, maintain compliance with applicable laws, maintain existing business

2.3 Edge Case Handling — MANDATORY FLAGS

The following conditions MUST be explicitly flagged in the output. Silently passing over them is scored as Fail for the relevant covenant.

Edge case	Required output
Waiver	State: (a) what covenant is waived, (b) the original threshold, (c) the waived threshold or relaxed condition, (d) the waiver period or effective testing date range, (e) explicit flag that this is temporary and the underlying covenant survives the waiver
Carve-out	List each carve-out from a negative covenant with its scope and any cap (e.g., "permitted financial indebtedness: subsidiary debt up to USD 25,000,000 in aggregate")
Grace period	State the number of grace period days and the triggering events that initiate the grace period
Cross-default / Cross-acceleration	Flag if present and extract applicable threshold

3. Benchmark Terms Validation (BMT)

3.1 Outcome Definitions

Label	Definition
Correct	A deviation from the APLMA or LMA benchmark is correctly identified. The output states: (a) the deal term value, (b) the benchmark reference value, (c) the quantified gap (in bps, percentage points, or qualitative description), and (d) the correct benchmark publication or clause reference.
True negative	No benchmark deviation is found AND the output confirms this by citing at least one applicable benchmark reference that agrees with the deal term. A true negative is not silence — it requires affirmative confirmation.
Fail	Deviation missed; incorrect benchmark cited (e.g., LMA standard applied to an APLMA deal); jurisdiction mismatch; deviation flagged for a term that is market-standard (false positive without evidence); benchmark quantification wrong by more than 20 bps.

3.2 Jurisdiction Matching

Benchmark selection is mandatory and must match the deal's governing law and market:

Deal jurisdiction / market	Required benchmark
Singapore law, APAC market	APLMA (Asia Pacific Loan Market Association)
English law, European market	LMA (Loan Market Association)
New York law, US market	LSTA (Loan Syndications and Trading Association)
Mixed / unclear	Flag jurisdiction ambiguity; do not apply a default benchmark without disclosure

3.3 Novel or Non-Standard Structures

When a deal's structure has no corresponding benchmark term (e.g., delayed-draw term loan with unusual commitment fee, bespoke amortisation schedule):

The output must state that benchmark comparison is inapplicable or limited for that specific term.
Standard terms within the same deal must still be evaluated against applicable benchmarks.
The system must not false-flag a non-standard structure as a "deviation" when no benchmark comparator exists.

4. Document Q&A

4.1 Outcome Definitions

Label	Definition
Correct	Answer is factually accurate and directly supported by the cited text. Scope is appropriate (answers only what is in the document). Uncertainty is explicitly acknowledged where the source text is genuinely ambiguous.
Grounded refusal	Information requested is not present in the document. Output explicitly states: "The document does not contain information about [topic]." Does not hallucinate a response. Does not provide a generic disclaimer without specifying what is absent.
Fail	Factually wrong answer; citation provided does not support the answer when reviewed; out-of-scope claim made with false confidence (e.g., inferring information not present); refusal to answer a question that is clearly answerable from document text; vague non-answer when a specific grounded answer exists.

4.2 Scope Rules

Rule	Requirement
Document-only answers	All answers must be derived exclusively from the referenced source document. No background knowledge about APLMA standards, market norms, or borrower financials may be imported unless the document itself references them.
Uncertainty acknowledgement	If source text uses qualified language (e.g., "as may be determined," "subject to market disruption"), the answer must reflect that qualification.
Citation requirement	Every substantive answer must cite the source document by name, clause/section reference, and verbatim quote of the supporting text.
Multi-clause synthesis	For questions requiring reading multiple clauses, all relevant sections must be cited. Synthesised answers must be traceable to each component.
Grounded refusal format	Must use the format: "The document does not contain information about [X]. If this information is required, it may be found in [Y — only if plausible]." Do not say "I don't know."

5. Cross-Cutting Provenance Standard

This standard applies to all capability outputs without exception.

5.1 Required Provenance Elements

Every extracted field, covenant identification, BMT comparison, and Q&A answer must include all three elements:

Element	Requirement
Source document name	Exact filename as submitted (e.g., `acme-corp-facility-agreement.pdf`)
Page / section reference	Page number AND clause/section identifier (e.g., `Page 14, Clause 5.1(a)`)
Verbatim quote	Exact text from the source document, sufficient to support the output. Must be findable by a human reviewer searching the source PDF.

5.2 Provenance Failure Modes

Failure	Classification
Citation present but points to wrong page	Partial (provenance score 0.5)
Citation present, page correct, but quote paraphrased not verbatim	Partial (provenance score 0.5)
Citation absent entirely	Fail (provenance score 0.0)
Quote is verbatim but does not support the claimed extracted value	Fail (provenance score 0.0)
Quote text cannot be found in source PDF by human reviewer	Hallucination (case score 0.0)

6. Passing Thresholds

These thresholds apply to all production-grade evaluations. Falling below any threshold requires a remediation sprint before release.

Scope	Threshold	Notes
Per-capability score	≥ 0.85	Weighted average across all cases in capability set
Tier 1 field score (per field)	≥ 0.70	Any single Tier 1 field below 0.70 blocks release
Hallucination rate	0.0	Zero tolerance — any hallucination requires root cause analysis before re-run
Provenance completeness	≥ 0.90	Proportion of outputs with full, correct provenance
Grounded refusal accuracy	≥ 0.95	Out-of-scope questions must be correctly refused

7. Version History

Version	Date	Change
1.0	2026-03-25	Initial release — covers Loan Onboarding, Covenant Monitoring, BMT, Document Q&A

1. Loan Onboarding Extraction​

1.1 Outcome Definitions​

1.2 Required Fields​

1.3 Field Tiers and Weights​

1.4 Amendment Documents​

1.5 Multi-Tranche Documents​

1.6 Confidence Scoring​

2. Covenant Monitoring​

2.1 Outcome Definitions​

2.2 Covenant Type Taxonomy​

2.3 Edge Case Handling — MANDATORY FLAGS​

3. Benchmark Terms Validation (BMT)​

3.1 Outcome Definitions​

3.2 Jurisdiction Matching​

3.3 Novel or Non-Standard Structures​

4. Document Q&A​

4.1 Outcome Definitions​

4.2 Scope Rules​

5. Cross-Cutting Provenance Standard​

5.1 Required Provenance Elements​

5.2 Provenance Failure Modes​

6. Passing Thresholds​

7. Version History​