Skip to content

Verification Guide

Guidelines for verifying Response Dependent Tasks using assertion operators.


Each Response Dependent Task Can Have

  • State modification (verified by objective assertions, for example JSON path assertions)
  • Factual verification using FACTUAL_VERIFICATION operator
  • Reasoning quality assessment using REASONING_QUALITY operator
  • Information precision assessment using INFORMATION_PRECISION operator

Factual Verification

Assertion Object

{
  "operator": "FACTUAL_VERIFICATION",
  "description": "Verify correct retrieval of store prices and delivery times.",
  "expected_facts": [
    {
      "fact": "Safeway price is $5.89",
      "weight": 1
    },
    {
      "fact": "Gus's price is $3.29",
      "weight": 1
    },
    {
      "fact": "Safeway delivery time is 34 minutes",
      "weight": 1.1
    }
  ],
  "pass_threshold_percent": 80
}

Judge Prompts

  1. Integration of Assertion Details: The judge prompt dynamically incorporates all specific details provided in the assertion object (facts, reasoning aspects, etc.) to ensure the evaluation is tailored to the specific task.

  2. Scoring Focus: The prompt instructs the model to provide individual scores for each verified element. The backend system handles all subsequent aggregation and average calculations to ensure mathematical accuracy.

  3. Clarity and Precision Penalties: Prompts include explicit instructions to penalize responses that contain statements weakening or obscuring the clarity of facts or reasoning. Scores are reduced for vague qualifiers, hedging, or irrelevant information.

  4. Comprehensive Few-Shot Examples: To standardize scoring, the prompt includes examples across the full range of complexity and quality levels. Each example clearly demonstrates why a specific score was assigned, particularly illustrating the application of clarity and precision penalties.


Enhanced Prompt Templates

1. Factual Judge Template

You are a factual accuracy evaluator. Your task is to score how accurately and clearly specific facts are presented in a model response.

TASK DESCRIPTION:
{{description}}

FACTS TO VERIFY:
{{#each expected_facts}}
{{@index}}. {{fact}}
{{/each}}

MODEL RESPONSE:
{{modelResponse}}

SCORING INSTRUCTIONS:
a) Score each fact from 1 to 5 based on accuracy and clarity
b) 5 = Fact stated perfectly and clearly
c) 4 = Fact stated accurately with minor clarity issues
d) 3 = Fact stated with some inaccuracy or significant clarity issues
e) 2 = Fact partially mentioned but unclear or mostly inaccurate
f) 1 = Fact missing, completely wrong, or so unclear as to be meaningless

CLARITY PENALTY RULE:
Reduce scores for any statements that weaken or obscure the clarity of facts, including:
- Vague qualifiers ("around", "approximately", "about", "roughly")
- Contradictory information within the response
- Hedging language that creates uncertainty ("I think", "it seems", "maybe")
- Buried facts within irrelevant information
- Ambiguous references that make facts unclear

FEW-SHOT EXAMPLES:

EXAMPLE 1 - Perfect Clarity (Score: 5)
Fact to verify: "Safeway price is $5.89"
Response: "The Clamato Tomato Juice at Safeway costs $5.89."
Score: 5
Reasoning: Exact price stated clearly and unambiguously.

EXAMPLE 2 - Minor Clarity Issue (Score: 4)
Fact to verify: "Safeway price is $5.89"
Response: "Safeway's price for the item is $5.89, which I found during my search."
Score: 4
Reasoning: Correct price, but "the item" is less specific than naming the product directly.

EXAMPLE 3 - Vague Qualifier (Score: 2)
Fact to verify: "Safeway price is $5.89"
Response: "Safeway costs around $6 or so for the Clamato juice."
Score: 2
Reasoning: Approximate price range given instead of exact amount. "Around" and "or so" weaken factual clarity.

EXAMPLE 4 - Contradictory Information (Score: 2)
Fact to verify: "Gus's price is $3.29"
Response: "Gus's Community Market has Gatorade for $3.29, though I also saw it listed as $3.49 in another section."
Score: 2
Reasoning: Correct price stated but immediately contradicted, creating confusion about the actual cost.

EXAMPLE 5 - Hedging Language (Score: 3)
Fact to verify: "Safeway delivery time is 34 minutes"
Response: "I think Safeway's delivery time was something like 34 minutes, if I remember correctly."
Score: 3
Reasoning: Correct time mentioned, but hedging language ("I think", "something like", "if I remember") undermines confidence in the fact.

EXAMPLE 6 - Buried in Irrelevant Information (Score: 3)
Fact to verify: "Gus's price is $3.29"
Response: "Gus's Community Market has a nice atmosphere with friendly staff and good produce displays. I noticed they had Gatorade Thirst Quencher, which costs $3.29, alongside many other beverages and snacks in their refrigerated section."
Score: 3
Reasoning: Correct price stated, but buried in excessive irrelevant details about store atmosphere and layout.

EXAMPLE 7 - Ambiguous Reference (Score: 2)
Fact to verify: "Safeway delivery time is 34 minutes"
Response: "Both stores offer delivery. The second store I visited delivers in 34 minutes."
Score: 2
Reasoning: Correct time but unclear reference - "second store" requires the reader to infer this means Safeway.

EXAMPLE 8 - Missing Fact (Score: 1)
Fact to verify: "Safeway price is $5.89"
Response: "I compared prices at both stores and found significant differences."
Score: 1
Reasoning: No specific price information provided for Safeway.

EXAMPLE 9 - Completely Wrong (Score: 1)
Fact to verify: "Gus's delivery time is 39 minutes"
Response: "Gus's Community Market offers delivery in 25 minutes."
Score: 1
Reasoning: Factually incorrect delivery time stated.

EXAMPLE 10 - Mostly Accurate with Minor Issue (Score: 4)
Fact to verify: "Gus's delivery time is 39 minutes"
Response: "Gus's Community Market has a delivery time of approximately 39 minutes."
Score: 4
Reasoning: Correct time, but "approximately" slightly weakens the clarity of the exact fact.

RESPONSE FORMAT:
{
  "operator": "FACTUAL_VERIFICATION",
  "fact_scores": {
    "Safeway price": 5,
    "Gus price": 3,
    "Delivery time": 4
  },
  "error": null
}

Do not calculate averages or overall scores. Provide only individual fact scores in above response format.

REASONING_QUALITY

Assertion Object

{
  "operator": "REASONING_QUALITY",
  "description": "Assess reasoning for comparing store prices and selecting best value.",
  "aspects": [
    {
      "aspect": "Reasoning correctly applies the rule (>10% difference = price-based recommendation).",
      "weight": 1
    },
    {
      "aspect": "Final recommendation aligns with reasoning and factual data.",
      "weight": 1.5
    }
  ],
  "pass_threshold_percent": 80
}

2. Reasoning Judge Template

You are a reasoning quality evaluator. Your task is to score how well the model demonstrates logical reasoning and draws valid conclusions.

TASK DESCRIPTION:
{{description}}

REASONING ASPECTS TO EVALUATE:
{{#each aspects}}
{{@index}}. {{aspect}}
{{/each}}

MODEL RESPONSE:
{{modelResponse}}

SCORING INSTRUCTIONS:
1. Score each reasoning aspect from 1 to 5 based on logical quality and validity
2. 5 = Reasoning is flawless, clearly explained, and perfectly applied
3. 4 = Reasoning is sound with minor gaps or clarity issues
4. 3 = Reasoning is generally correct but has some logical flaws or unclear steps
5. 2 = Reasoning is partially correct but contains significant logical errors
6. 1 = Reasoning is absent, fundamentally flawed, or completely incorrect

REASONING CLARITY PENALTY RULE:
Reduce scores for statements that weaken or obscure the reasoning process, including:
- Circular reasoning or logical fallacies
- Skipped logical steps without explanation
- Contradictory reasoning within the response
- Unsupported conclusions that don't follow from premises
- Vague or ambiguous reasoning that's hard to follow
- Incorrect application of given rules or criteria

FEW-SHOT EXAMPLES:

EXAMPLE 1 - Perfect Reasoning (Score: 5)
Aspect: "Reasoning correctly applies the rule (>10% difference = price-based recommendation)"
Response: "The price difference is 79% ((5.89-3.29)/3.29 × 100). Since 79% > 10%, I should recommend based on price, not delivery time."
Score: 5
Reasoning: Clear calculation, correct rule identification, proper application of logic.

EXAMPLE 2 - Sound with Minor Gap (Score: 4)
Aspect: "Final recommendation aligns with reasoning and factual data"
Response: "Since the price difference exceeds 10%, I recommend Gus's Community Market as it's the cheaper option at $3.29."
Score: 4
Reasoning: Correct recommendation and alignment, but doesn't explicitly show the percentage calculation.

EXAMPLE 3 - Generally Correct with Logical Flaw (Score: 3)
Aspect: "Reasoning correctly applies the rule (>10% difference = price-based recommendation)"
Response: "There's a big price difference between stores, much more than 10%, so I'll go with the cheaper store."
Score: 3
Reasoning: Understands the rule concept but lacks specific calculation and precise application.

EXAMPLE 4 - Partially Correct with Significant Error (Score: 2)
Aspect: "Final recommendation aligns with reasoning and factual data"
Response: "Since Safeway is faster (34 min vs 39 min) and the price difference is over 10%, I recommend Safeway for better value."
Score: 2
Reasoning: Correctly identifies delivery times but recommends the more expensive option despite stating that the price-based rule should apply.

EXAMPLE 5 - Contradictory Reasoning (Score: 2)
Aspect: "Reasoning correctly applies the rule (>10% difference = price-based recommendation)"
Response: "The price difference is significant at 79%, which means I should choose based on price. However, since delivery speed also matters, I'll factor that in too."
Score: 2
Reasoning: Correctly calculates and identifies the rule, but then contradicts it by adding delivery consideration.

EXAMPLE 6 - Skipped Logical Steps (Score: 3)
Aspect: "Reasoning correctly applies the rule (>10% difference = price-based recommendation)"
Response: "I need to compare prices. Gus's is cheaper, so I recommend Gus's."
Score: 3
Reasoning: Reaches the correct conclusion but skips the critical step of calculating the percentage difference and explicitly applying the >10% rule.

EXAMPLE 7 - Fundamentally Flawed (Score: 1)
Aspect: "Final recommendation aligns with reasoning and factual data"
Response: "I recommend Safeway because it has better customer service and store layout."
Score: 1
Reasoning: Recommendation based on irrelevant factors, completely ignoring price and delivery time data and task requirements.

EXAMPLE 8 - Absent Reasoning (Score: 1)
Aspect: "Reasoning correctly applies the rule (>10% difference = price-based recommendation)"
Response: "I compared both stores and made my choice."
Score: 1
Reasoning: No actual reasoning process shown, no rule application, no logical steps provided.

EXAMPLE 9 - Circular Reasoning (Score: 2)
Aspect: "Final recommendation aligns with reasoning and factual data"
Response: "I recommend Gus's because it's the better choice, and it's better because I recommend it based on the comparison."
Score: 2
Reasoning: Circular logic that doesn't actually provide valid reasoning for the recommendation.

EXAMPLE 10 - Minor Clarity Issue (Score: 4)
Aspect: "Reasoning correctly applies the rule (>10% difference = price-based recommendation)"
Response: "The prices are $3.29 and $5.89, which is definitely more than a 10% difference, so price should determine my recommendation."
Score: 4
Reasoning: Correct rule application and logic, but doesn't show the specific percentage calculation (79%).

RESPONSE FORMAT:
Return only a JSON object with this exact structure:
{
  "operator": "REASONING_QUALITY",
  "aspect_scores": {
    "Reasoning correctly applies the rule": 5,
    "Final recommendation aligns with reasoning": 3
  },
  "error": null
}

Do not calculate averages or overall scores. Provide only individual aspect scores as integers 1-5.

INFORMATION_PRECISION

Assertion Object

{
  "operator": "INFORMATION_PRECISION",
  "description": "Ensure the response includes only task relevant data and avoids hallucinations.",
  "expected_facts": [
    "Safeway price is $5.89",
    "Gus's price is $3.29",
    "Safeway delivery time is 34 minutes"
  ],
  "expected_reasonings": [
    "Reasoning to correctly apply the rule (>10% difference = price-based recommendation).",
    "Reasoning for final recommendation"
  ],
  "pass_threshold_percent": 80
}

3. Precision Judge Template

You are an information precision evaluator. Your task is to score how precisely and accurately the model response addresses the task without including irrelevant information or hallucinations.

TASK DESCRIPTION:
{{description}}

EXPECTED FACTS TO BE INCLUDED:
{{#each expected_facts}}
{{@index}}. {{this}}
{{/each}}

EXPECTED REASONING TO BE INCLUDED:
{{#each expected_reasonings}}
{{@index}}. {{this}}
{{/each}}

MODEL RESPONSE:
{{modelResponse}}

SCORING INSTRUCTIONS:
1. Score each expected element from 1 to 5 based on precision and accuracy
2. 5 = Element present, accurate, and precisely stated without unnecessary elaboration
3. 4 = Element present and accurate with minor imprecision or slight irrelevant details
4. 3 = Element present but mixed with some irrelevant information or minor inaccuracies
5. 2 = Element partially present but obscured by significant irrelevant content or inaccuracies
6. 1 = Element missing, completely inaccurate, or so buried in irrelevant content as to be ineffective

PRECISION PENALTY RULES:
Reduce scores for any content that reduces information precision, including:
- Hallucinated or fabricated information not supported by available data
- Excessive irrelevant details that don't serve the task
- Speculative statements presented as facts
- Redundant or repetitive information
- Off-topic elaborations or tangential discussions
- Vague generalizations when specific information is required

FEW-SHOT EXAMPLES:

EXAMPLE 1 - Perfect Precision (Score: 5)
Expected: "Safeway price is $5.89"
Response: "Safeway's Clamato Tomato Juice costs $5.89."
Score: 5
Reasoning: Exact price stated clearly with the necessary product context only.

EXAMPLE 2 - Minor Imprecision (Score: 4)
Expected: "Reasoning for final recommendation"
Response: "I recommend Gus's Community Market because it's $2.60 cheaper ($5.89 - $3.29), and with a 79% price difference exceeding the 10% threshold, price takes priority over the 5-minute delivery advantage Safeway offers."
Score: 4
Reasoning: Complete reasoning is present with slight additional calculation detail that's relevant but not essential.

EXAMPLE 3 - Mixed with Irrelevant Content (Score: 3)
Expected: "Gus's price is $3.29"
Response: "Gus's Community Market, which has been serving the community for years with fresh produce and friendly staff, offers Gatorade Thirst Quencher for $3.29."
Score: 3
Reasoning: Correct price stated, but mixed with irrelevant historical and service information.

EXAMPLE 4 - Significant Irrelevant Content (Score: 2)
Expected: "Safeway delivery time is 34 minutes"
Response: "Safeway, a major grocery chain known for its wide selection of organic foods, premium deli section, and convenient pharmacy services, offers delivery in 34 minutes, which is competitive in today's fast-paced delivery market where customers expect quick service."
Score: 2
Reasoning: Correct delivery time is buried in excessive irrelevant information about store features and market context.

EXAMPLE 5 - Hallucinated Information (Score: 1)
Expected: "Reasoning to correctly apply the rule (>10% difference = price-based recommendation)"
Response: "Based on customer reviews and store ratings, plus my analysis of seasonal pricing trends and regional market conditions, I believe price should be the determining factor."
Score: 1
Reasoning: Contains fabricated information (reviews, ratings, trends) not available in the task, completely missing the actual >10% rule.

EXAMPLE 6 - Vague Generalization (Score: 2)
Expected: "Safeway price is $5.89"
Response: "Safeway has competitive pricing on its juice selection, with most items falling in the affordable range for budget-conscious shoppers."
Score: 2
Reasoning: Vague pricing generalization instead of specific required price information.

EXAMPLE 7 - Redundant Information (Score: 3)
Expected: "Gus's price is $3.29"
Response: "Gus's Community Market charges $3.29 for the Gatorade Thirst Quencher. This $3.29 price point at Gus's is quite reasonable. The cost of $3.29 makes it an attractive option."
Score: 3
Reasoning: Correct price stated but unnecessarily repeated three times, reducing precision through redundancy.

EXAMPLE 8 - Speculative Content (Score: 2)
Expected: "Safeway delivery time is 34 minutes"
Response: "Safeway delivers in 34 minutes, though this might vary depending on traffic conditions, driver availability, and order complexity, potentially ranging from 30-45 minutes in practice."
Score: 2
Reasoning: Correct base information, but mixed with speculative details not supported by available data.

EXAMPLE 9 - Missing Essential Element (Score: 1)
Expected: "Reasoning to correctly apply the rule (>10% difference = price-based recommendation)"
Response: "I've made my recommendation based on careful consideration of all factors."
Score: 1
Reasoning: No actual reasoning provided, just a vague statement about the consideration process.

EXAMPLE 10 - Off-topic Elaboration (Score: 2)
Expected: "Reasoning for final recommendation"
Response: "I recommend Gus's because it's cheaper. Speaking of groceries, it's interesting how delivery services have revolutionized shopping habits and changed consumer behavior patterns across different demographics."
Score: 2
Reasoning: Basic reasoning is present, but followed by a completely off-topic discussion about delivery services and demographics.

RESPONSE FORMAT:
Return only a JSON object with this exact structure:
{
  "operator": "INFORMATION_PRECISION",
  "precision_scores": {
    "Safeway price": 5,
    "Gus price": 4,
    "Safeway delivery time": 3,
    "Rule application reasoning": 4,
    "Final recommendation reasoning": 5
  },
  "error": null
}

Do not calculate averages or overall scores. Provide only individual precision scores as integers 1-5.