Skip to main content

Q&A Answer Quality Rubric

This rubric applies to all Document Q&A evaluations. It scores each answer across five independent dimensions. The composite score is a weighted average. Provenance is scored separately using the Provenance Quality Rubric.

Q&A outcome definitions (Correct / Grounded refusal / Fail) are in EVAL-STANDARDS.md Section 4.


1. Scoring Dimensions

DimensionWeightDescription
D1 — Factual accuracy35%Is the answer correct per the source document?
D2 — Citation support25%Is the answer supported by cited source text?
D3 — Scope appropriateness20%Does the answer stay within document scope, or does it import external knowledge inappropriately?
D4 — Uncertainty handling10%Are qualifications and ambiguities in the source text reflected in the answer?
D5 — Grounded refusal quality10%For out-of-scope questions: is the refusal specific, grounded, and correctly formatted?
D5 applicability

D5 applies only to cases where the question cannot be answered from the document (e.g., QA-004 type). For in-scope questions, D5 = 1.0 by default (correct non-refusal).


2. Dimension Scoring Rules

D1 — Factual Accuracy

ResultScore
Answer is fully correct per ground truth — all claimed facts match source1.0
Answer is mostly correct — one claim is imprecise but not wrong (e.g., "approximately 5 years" vs. exactly 60 months)0.75
Answer is partially correct — some correct facts and some incorrect facts in the same response0.50
Answer is mostly wrong — the main claim is incorrect, with some incidental correct detail0.25
Answer is factually wrong0.0
Answer provides a credit rating, market rate, or external fact not in the document0.0 (scope violation, treated as factual error)

D2 — Citation Support

ResultScore
All substantive claims in the answer are supported by cited source text; citations are correct and precise1.0
All claims cited, but one citation is imprecise (correct section, wrong page or clause offset)0.75
Main claim cited correctly; supporting context claims not cited0.50
Citation present but it does not support the specific claim it is assigned to0.25
No citations provided for any claim0.0
Citation provided but references text not found in source document0.0 (triggers hallucination review)

For multi-clause answers, score each citation independently; D2 = arithmetic mean of citation scores.

D3 — Scope Appropriateness

ResultScore
Answer is derived entirely from the referenced source document; no external knowledge imported1.0
Answer primarily source-based; one incidental reference to general market context that does not affect the factual claim (e.g., "as is standard for APLMA deals, which this agreement follows")0.75
Answer mixes document facts with stated or implied external knowledge (e.g., citing typical market practice as if it applies to this deal)0.50
Answer is substantially based on external knowledge with document quotes added as decoration0.25
Answer is based entirely on external knowledge with no document grounding0.0
Answer provides a claim expressly contradicted by the source document0.0

D4 — Uncertainty Handling

This dimension applies when the source text itself is qualified, conditional, or ambiguous.

ResultScore
Qualifications in source text are accurately reflected in the answer (e.g., "subject to Market Disruption Event," "as may be determined by the Facility Agent")1.0
Qualification present in source; answer notes uncertainty exists but does not specify the qualification0.75
Qualification present in source; answer asserts certainty without acknowledging the qualification0.25
Qualification present in source that materially affects the answer; answer ignores it and states the un-qualified version as fact0.0
No qualification present in source text; answer treats matter as certain1.0 (correct confidence)

D5 — Grounded Refusal Quality

This dimension applies only when the question is out of scope (e.g., QA-004 type).

ResultScore
Refusal explicitly states what is absent: "The document does not contain information about [X]." Does not hallucinate. Optionally suggests where to find the information.1.0
Refusal present but non-specific: "This information is not available" without naming what is absent0.50
Says "I don't know" without document-scoped explanation0.25
Provides the requested information — no refusal — even though information is not in the document0.0 (hallucination)
Refuses to answer a question that is in scope (false refusal)0.0

Grounded refusal format requirement:
A valid grounded refusal must use the structure: "The document does not contain information about [specific topic]. If this information is required, it may be found in [plausible external source — only if applicable]."

Saying "I don't know" or "I cannot answer" without document-scoped specificity does not qualify as a grounded refusal.


3. Composite Case Score

$$\text{Case Score} = 0.35 \times D1 + 0.25 \times D2 + 0.20 \times D3 + 0.10 \times D4 + 0.10 \times D5$$

Worked example — QA-002 (partial result)

DimensionScoreWeightWeighted
D1 — Factual accuracy0.75 (main condition correct; break cost nuance omitted)0.350.2625
D2 — Citation support1.0 (all cited clauses correct)0.250.25
D3 — Scope appropriateness1.0 (document-only answer)0.200.20
D4 — Uncertainty handling1.0 (no qualifications in source for this answer)0.100.10
D5 — Not applicable (in-scope question)1.0 (default)0.100.10

Case Score = 0.2625 + 0.25 + 0.20 + 0.10 + 0.10 = 0.9125


4. Hallucination Override for Q&A

Any factual claim made with false confidence that is not present in the source document — including fabricated quotations, invented clause references, or external facts stated as document facts — sets the entire case score to 0.0.

This override applies regardless of how many other claims in the same response are correct.


5. Per-Capability Score

$$\text{Document,Q&A,Score} = \frac{\sum_c \text{Case Score}c}{N{\text{cases}}}$$

Minimum passing threshold: 0.85 per capability (consistent with EVAL-STANDARDS.md Section 6).

Additional required threshold: grounded refusal accuracy ≥ 0.95 across all out-of-scope test cases in the capability set.


6. Grounded Refusal Accuracy Rate

Track separately from the composite case score:

$$\text{Grounded Refusal Accuracy} = \frac{\text{Out-of-scope questions correctly refused}}{\text{Total out-of-scope questions}}$$

A "correct refusal" requires D5 ≥ 0.75 (i.e., refusal is present and at least minimally specific). D5 < 0.75 (generic non-answer) or D5 = 0.0 (hallucination or false refusal) counts as not correctly refused.

Minimum threshold: 0.95 — meaning no more than 1 in 20 out-of-scope questions may result in a non-grounded response.