Q&A Answer Quality Rubric

This rubric applies to all Document Q&A evaluations. It scores each answer across five independent dimensions. The composite score is a weighted average. Provenance is scored separately using the Provenance Quality Rubric.

Q&A outcome definitions (Correct / Grounded refusal / Fail) are in EVAL-STANDARDS.md Section 4.

1. Scoring Dimensions

Dimension	Weight	Description
D1 — Factual accuracy	35%	Is the answer correct per the source document?
D2 — Citation support	25%	Is the answer supported by cited source text?
D3 — Scope appropriateness	20%	Does the answer stay within document scope, or does it import external knowledge inappropriately?
D4 — Uncertainty handling	10%	Are qualifications and ambiguities in the source text reflected in the answer?
D5 — Grounded refusal quality	10%	For out-of-scope questions: is the refusal specific, grounded, and correctly formatted?

D5 applicability

D5 applies only to cases where the question cannot be answered from the document (e.g., QA-004 type). For in-scope questions, D5 = 1.0 by default (correct non-refusal).

2. Dimension Scoring Rules

D1 — Factual Accuracy

Result	Score
Answer is fully correct per ground truth — all claimed facts match source	1.0
Answer is mostly correct — one claim is imprecise but not wrong (e.g., "approximately 5 years" vs. exactly 60 months)	0.75
Answer is partially correct — some correct facts and some incorrect facts in the same response	0.50
Answer is mostly wrong — the main claim is incorrect, with some incidental correct detail	0.25
Answer is factually wrong	0.0
Answer provides a credit rating, market rate, or external fact not in the document	0.0 (scope violation, treated as factual error)

D2 — Citation Support

Result	Score
All substantive claims in the answer are supported by cited source text; citations are correct and precise	1.0
All claims cited, but one citation is imprecise (correct section, wrong page or clause offset)	0.75
Main claim cited correctly; supporting context claims not cited	0.50
Citation present but it does not support the specific claim it is assigned to	0.25
No citations provided for any claim	0.0
Citation provided but references text not found in source document	0.0 (triggers hallucination review)

For multi-clause answers, score each citation independently; D2 = arithmetic mean of citation scores.

D3 — Scope Appropriateness

Result	Score
Answer is derived entirely from the referenced source document; no external knowledge imported	1.0
Answer primarily source-based; one incidental reference to general market context that does not affect the factual claim (e.g., "as is standard for APLMA deals, which this agreement follows")	0.75
Answer mixes document facts with stated or implied external knowledge (e.g., citing typical market practice as if it applies to this deal)	0.50
Answer is substantially based on external knowledge with document quotes added as decoration	0.25
Answer is based entirely on external knowledge with no document grounding	0.0
Answer provides a claim expressly contradicted by the source document	0.0

D4 — Uncertainty Handling

This dimension applies when the source text itself is qualified, conditional, or ambiguous.

Result	Score
Qualifications in source text are accurately reflected in the answer (e.g., "subject to Market Disruption Event," "as may be determined by the Facility Agent")	1.0
Qualification present in source; answer notes uncertainty exists but does not specify the qualification	0.75
Qualification present in source; answer asserts certainty without acknowledging the qualification	0.25
Qualification present in source that materially affects the answer; answer ignores it and states the un-qualified version as fact	0.0
No qualification present in source text; answer treats matter as certain	1.0 (correct confidence)

D5 — Grounded Refusal Quality

This dimension applies only when the question is out of scope (e.g., QA-004 type).

Result	Score
Refusal explicitly states what is absent: "The document does not contain information about [X]." Does not hallucinate. Optionally suggests where to find the information.	1.0
Refusal present but non-specific: "This information is not available" without naming what is absent	0.50
Says "I don't know" without document-scoped explanation	0.25
Provides the requested information — no refusal — even though information is not in the document	0.0 (hallucination)
Refuses to answer a question that is in scope (false refusal)	0.0

Grounded refusal format requirement:
A valid grounded refusal must use the structure: "The document does not contain information about [specific topic]. If this information is required, it may be found in [plausible external source — only if applicable]."

Saying "I don't know" or "I cannot answer" without document-scoped specificity does not qualify as a grounded refusal.

3. Composite Case Score

$$\text{Case Score} = 0.35 \times D1 + 0.25 \times D2 + 0.20 \times D3 + 0.10 \times D4 + 0.10 \times D5$$

Worked example — QA-002 (partial result)

Dimension	Score	Weight	Weighted
D1 — Factual accuracy	0.75 (main condition correct; break cost nuance omitted)	0.35	0.2625
D2 — Citation support	1.0 (all cited clauses correct)	0.25	0.25
D3 — Scope appropriateness	1.0 (document-only answer)	0.20	0.20
D4 — Uncertainty handling	1.0 (no qualifications in source for this answer)	0.10	0.10
D5 — Not applicable (in-scope question)	1.0 (default)	0.10	0.10

Case Score = 0.2625 + 0.25 + 0.20 + 0.10 + 0.10 = 0.9125

4. Hallucination Override for Q&A

Any factual claim made with false confidence that is not present in the source document — including fabricated quotations, invented clause references, or external facts stated as document facts — sets the entire case score to 0.0.

This override applies regardless of how many other claims in the same response are correct.

5. Per-Capability Score

$$\text{Document,Q&A,Score} = \frac{\sum_c \text{Case Score}c}{N{\text{cases}}}$$

Minimum passing threshold: 0.85 per capability (consistent with EVAL-STANDARDS.md Section 6).

Additional required threshold: grounded refusal accuracy ≥ 0.95 across all out-of-scope test cases in the capability set.

6. Grounded Refusal Accuracy Rate

Track separately from the composite case score:

$$\text{Grounded Refusal Accuracy} = \frac{\text{Out-of-scope questions correctly refused}}{\text{Total out-of-scope questions}}$$

A "correct refusal" requires D5 ≥ 0.75 (i.e., refusal is present and at least minimally specific). D5 < 0.75 (generic non-answer) or D5 = 0.0 (hallucination or false refusal) counts as not correctly refused.

Minimum threshold: 0.95 — meaning no more than 1 in 20 out-of-scope questions may result in a non-grounded response.

1. Scoring Dimensions​

2. Dimension Scoring Rules​

D1 — Factual Accuracy​

D2 — Citation Support​

D3 — Scope Appropriateness​

D4 — Uncertainty Handling​

D5 — Grounded Refusal Quality​

3. Composite Case Score​

Worked example — QA-002 (partial result)​

4. Hallucination Override for Q&A​

5. Per-Capability Score​

6. Grounded Refusal Accuracy Rate​