Q&A Answer Quality Rubric
This rubric applies to all Document Q&A evaluations. It scores each answer across five independent dimensions. The composite score is a weighted average. Provenance is scored separately using the Provenance Quality Rubric.
Q&A outcome definitions (Correct / Grounded refusal / Fail) are in EVAL-STANDARDS.md Section 4.
1. Scoring Dimensions
| Dimension | Weight | Description |
|---|---|---|
| D1 — Factual accuracy | 35% | Is the answer correct per the source document? |
| D2 — Citation support | 25% | Is the answer supported by cited source text? |
| D3 — Scope appropriateness | 20% | Does the answer stay within document scope, or does it import external knowledge inappropriately? |
| D4 — Uncertainty handling | 10% | Are qualifications and ambiguities in the source text reflected in the answer? |
| D5 — Grounded refusal quality | 10% | For out-of-scope questions: is the refusal specific, grounded, and correctly formatted? |
D5 applies only to cases where the question cannot be answered from the document (e.g., QA-004 type). For in-scope questions, D5 = 1.0 by default (correct non-refusal).
2. Dimension Scoring Rules
D1 — Factual Accuracy
| Result | Score |
|---|---|
| Answer is fully correct per ground truth — all claimed facts match source | 1.0 |
| Answer is mostly correct — one claim is imprecise but not wrong (e.g., "approximately 5 years" vs. exactly 60 months) | 0.75 |
| Answer is partially correct — some correct facts and some incorrect facts in the same response | 0.50 |
| Answer is mostly wrong — the main claim is incorrect, with some incidental correct detail | 0.25 |
| Answer is factually wrong | 0.0 |
| Answer provides a credit rating, market rate, or external fact not in the document | 0.0 (scope violation, treated as factual error) |
D2 — Citation Support
| Result | Score |
|---|---|
| All substantive claims in the answer are supported by cited source text; citations are correct and precise | 1.0 |
| All claims cited, but one citation is imprecise (correct section, wrong page or clause offset) | 0.75 |
| Main claim cited correctly; supporting context claims not cited | 0.50 |
| Citation present but it does not support the specific claim it is assigned to | 0.25 |
| No citations provided for any claim | 0.0 |
| Citation provided but references text not found in source document | 0.0 (triggers hallucination review) |
For multi-clause answers, score each citation independently; D2 = arithmetic mean of citation scores.
D3 — Scope Appropriateness
| Result | Score |
|---|---|
| Answer is derived entirely from the referenced source document; no external knowledge imported | 1.0 |
| Answer primarily source-based; one incidental reference to general market context that does not affect the factual claim (e.g., "as is standard for APLMA deals, which this agreement follows") | 0.75 |
| Answer mixes document facts with stated or implied external knowledge (e.g., citing typical market practice as if it applies to this deal) | 0.50 |
| Answer is substantially based on external knowledge with document quotes added as decoration | 0.25 |
| Answer is based entirely on external knowledge with no document grounding | 0.0 |
| Answer provides a claim expressly contradicted by the source document | 0.0 |
D4 — Uncertainty Handling
This dimension applies when the source text itself is qualified, conditional, or ambiguous.
| Result | Score |
|---|---|
| Qualifications in source text are accurately reflected in the answer (e.g., "subject to Market Disruption Event," "as may be determined by the Facility Agent") | 1.0 |
| Qualification present in source; answer notes uncertainty exists but does not specify the qualification | 0.75 |
| Qualification present in source; answer asserts certainty without acknowledging the qualification | 0.25 |
| Qualification present in source that materially affects the answer; answer ignores it and states the un-qualified version as fact | 0.0 |
| No qualification present in source text; answer treats matter as certain | 1.0 (correct confidence) |
D5 — Grounded Refusal Quality
This dimension applies only when the question is out of scope (e.g., QA-004 type).
| Result | Score |
|---|---|
| Refusal explicitly states what is absent: "The document does not contain information about [X]." Does not hallucinate. Optionally suggests where to find the information. | 1.0 |
| Refusal present but non-specific: "This information is not available" without naming what is absent | 0.50 |
| Says "I don't know" without document-scoped explanation | 0.25 |
| Provides the requested information — no refusal — even though information is not in the document | 0.0 (hallucination) |
| Refuses to answer a question that is in scope (false refusal) | 0.0 |
Grounded refusal format requirement:
A valid grounded refusal must use the structure: "The document does not contain information about [specific topic]. If this information is required, it may be found in [plausible external source — only if applicable]."
Saying "I don't know" or "I cannot answer" without document-scoped specificity does not qualify as a grounded refusal.
3. Composite Case Score
$$\text{Case Score} = 0.35 \times D1 + 0.25 \times D2 + 0.20 \times D3 + 0.10 \times D4 + 0.10 \times D5$$
Worked example — QA-002 (partial result)
| Dimension | Score | Weight | Weighted |
|---|---|---|---|
| D1 — Factual accuracy | 0.75 (main condition correct; break cost nuance omitted) | 0.35 | 0.2625 |
| D2 — Citation support | 1.0 (all cited clauses correct) | 0.25 | 0.25 |
| D3 — Scope appropriateness | 1.0 (document-only answer) | 0.20 | 0.20 |
| D4 — Uncertainty handling | 1.0 (no qualifications in source for this answer) | 0.10 | 0.10 |
| D5 — Not applicable (in-scope question) | 1.0 (default) | 0.10 | 0.10 |
Case Score = 0.2625 + 0.25 + 0.20 + 0.10 + 0.10 = 0.9125
4. Hallucination Override for Q&A
Any factual claim made with false confidence that is not present in the source document — including fabricated quotations, invented clause references, or external facts stated as document facts — sets the entire case score to 0.0.
This override applies regardless of how many other claims in the same response are correct.
5. Per-Capability Score
$$\text{Document,Q&A,Score} = \frac{\sum_c \text{Case Score}c}{N{\text{cases}}}$$
Minimum passing threshold: 0.85 per capability (consistent with EVAL-STANDARDS.md Section 6).
Additional required threshold: grounded refusal accuracy ≥ 0.95 across all out-of-scope test cases in the capability set.
6. Grounded Refusal Accuracy Rate
Track separately from the composite case score:
$$\text{Grounded Refusal Accuracy} = \frac{\text{Out-of-scope questions correctly refused}}{\text{Total out-of-scope questions}}$$
A "correct refusal" requires D5 ≥ 0.75 (i.e., refusal is present and at least minimally specific). D5 < 0.75 (generic non-answer) or D5 = 0.0 (hallucination or false refusal) counts as not correctly refused.
Minimum threshold: 0.95 — meaning no more than 1 in 20 out-of-scope questions may result in a non-grounded response.