Verification Guide¶

Guidelines for verifying Response Dependent Tasks using assertion operators.

Each Response Dependent Task Can Have¶

State modification (verified by objective assertions, for example JSON path assertions)
Factual verification using FACTUAL_VERIFICATION operator
Reasoning quality assessment using REASONING_QUALITY operator
Information precision assessment using INFORMATION_PRECISION operator

Factual Verification¶

Assertion Object¶

{
  "operator": "FACTUAL_VERIFICATION",
  "description": "Verify correct retrieval of store prices and delivery times.",
  "expected_facts": [
    {
      "fact": "Safeway price is $5.89",
      "weight": 1
    },
    {
      "fact": "Gus's price is $3.29",
      "weight": 1
    },
    {
      "fact": "Safeway delivery time is 34 minutes",
      "weight": 1.1
    }
  ],
  "pass_threshold_percent": 80
}

Judge Prompts¶

Add the details provided in the assertion object appropriately
Just ask the model to provide only scores for each fact. Backend should do the average calculation
Add an instruction to the judge prompt to consider any statements that weaken or obscure the clarity of the facts. If such statements are there, the score should be lower
Add examples for each complexity. For each example, provide why that rating was given. Have examples covering point 3 above

Enhanced Prompt Templates¶

1. Factual Judge Template¶

You are a factual accuracy evaluator. Your task is to score how accurately and clearly specific facts are presented in a model response.

TASK DESCRIPTION: {{description}}

FACTS TO VERIFY:

{{#each expected_facts}}
{{@index}}. {{fact}}
{{/each}}

MODEL RESPONSE: {{modelResponse}}

Scoring Instructions¶

Score	Description
5	Fact stated perfectly and clearly
4	Fact stated accurately with minor clarity issues
3	Fact stated with some inaccuracy or significant clarity issues
2	Fact partially mentioned but unclear or mostly inaccurate
1	Fact missing, completely wrong, or so unclear as to be meaningless

Clarity Penalty Rule¶

Reduce scores for any statements that weaken or obscure the clarity of facts, including:

Vague qualifiers ("around", "approximately", "about", "roughly")
Contradictory information within the response
Hedging language that creates uncertainty ("I think", "it seems", "maybe")
Buried facts within irrelevant information
Ambiguous references that make facts unclear

Few-Shot Examples¶

Example 1 - Perfect Clarity (Score: 5)

Fact to verify: "Safeway price is $5.89"

Response: "The Clamato Tomato Juice at Safeway costs $5.89."

Reasoning: Exact price stated clearly and unambiguously.

Example 2 - Minor Clarity Issue (Score: 4)

Fact to verify: "Safeway price is $5.89"

Response: "Safeway's price for the item is $5.89, which I found during my search."

Reasoning: Correct price, but "the item" is less specific than naming the product directly.

Example 3 - Vague Qualifier (Score: 2)

Fact to verify: "Safeway price is $5.89"

Response: "Safeway costs around $6 or so for the Clamato juice."

Reasoning: Approximate price range given instead of exact amount. "Around" and "or so" weaken factual clarity.

Example 4 - Contradictory Information (Score: 2)

Fact to verify: "Gus's price is $3.29"

Response: "Gus's Community Market has Gatorade for $3.29, though I also saw it listed as $3.49 in another section."

Reasoning: Correct price stated but immediately contradicted, creating confusion about the actual cost.

Example 5 - Hedging Language (Score: 3)

Fact to verify: "Safeway delivery time is 34 minutes"

Response: "I think Safeway's delivery time was something like 34 minutes, if I remember correctly."

Reasoning: Correct time mentioned, but hedging language ("I think", "something like", "if I remember") undermines confidence in the fact.

Example 6 - Missing Fact (Score: 1)

Fact to verify: "Safeway price is $5.89"

Response: "I compared prices at both stores and found significant differences."

Reasoning: No specific price information provided for Safeway.

Response Format¶

{
  "operator": "FACTUAL_VERIFICATION",
  "fact_scores": {
    "Safeway price": 5,
    "Gus price": 3,
    "Delivery time": 4
  },
  "error": null
}

Note

Do not calculate averages or overall scores. Provide only individual fact scores in above response format.

REASONING_QUALITY¶

Assertion Object¶

{
  "operator": "REASONING_QUALITY",
  "description": "Assess reasoning for comparing store prices and selecting best value.",
  "aspects": [
    {
      "aspect": "Reasoning correctly applies the rule (>10% difference = price-based recommendation).",
      "weight": 1
    },
    {
      "aspect": "Final recommendation aligns with reasoning and factual data.",
      "weight": 1.5
    }
  ],
  "pass_threshold_percent": 80
}

2. Reasoning Judge Template¶

You are a reasoning quality evaluator. Your task is to score how well the model demonstrates logical reasoning and draws valid conclusions.

TASK DESCRIPTION: {{description}}

REASONING ASPECTS TO EVALUATE:

{{#each aspects}}
{{@index}}. {{aspect}}
{{/each}}

MODEL RESPONSE: {{modelResponse}}

Scoring Instructions¶

Score	Description
5	Reasoning is flawless, clearly explained, and perfectly applied
4	Reasoning is sound with minor gaps or clarity issues
3	Reasoning is generally correct but has some logical flaws or unclear steps
2	Reasoning is partially correct but contains significant logical errors
1	Reasoning is absent, fundamentally flawed, or completely incorrect

Reasoning Clarity Penalty Rule¶

Reduce scores for statements that weaken or obscure the reasoning process, including:

Circular reasoning or logical fallacies
Skipped logical steps without explanation
Contradictory reasoning within the response
Unsupported conclusions that don't follow from premises
Vague or ambiguous reasoning that's hard to follow
Incorrect application of given rules or criteria

Few-Shot Examples¶

Example 1 - Perfect Reasoning (Score: 5)

Aspect: "Reasoning correctly applies the rule (>10% difference = price-based recommendation)"

Response: "The price difference is 79% ((5.89-3.29)/3.29 × 100). Since 79% > 10%, I should recommend based on price, not delivery time."

Reasoning: Clear calculation, correct rule identification, proper application of logic.

Example 2 - Sound with Minor Gap (Score: 4)

Aspect: "Final recommendation aligns with reasoning and factual data"

Response: "Since the price difference exceeds 10%, I recommend Gus's Community Market as it's the cheaper option at $3.29."

Reasoning: Correct recommendation and alignment, but doesn't explicitly show the percentage calculation.

Example 3 - Contradictory Reasoning (Score: 2)

Aspect: "Reasoning correctly applies the rule (>10% difference = price-based recommendation)"

Response: "The price difference is significant at 79%, which means I should choose based on price. However, since delivery speed also matters, I'll factor that in too."

Reasoning: Correctly calculates and identifies the rule, but then contradicts it by adding delivery consideration.

Example 4 - Fundamentally Flawed (Score: 1)

Aspect: "Final recommendation aligns with reasoning and factual data"

Response: "I recommend Safeway because it has better customer service and store layout."

Reasoning: Recommendation based on irrelevant factors, completely ignoring price and delivery time data and task requirements.

Response Format¶

{
  "operator": "REASONING_QUALITY",
  "aspect_scores": {
    "Reasoning correctly applies the rule": 5,
    "Final recommendation aligns with reasoning": 3
  },
  "error": null
}

Note

Do not calculate averages or overall scores. Provide only individual aspect scores as integers 1-5.

INFORMATION_PRECISION¶

Assertion Object¶

{
  "operator": "INFORMATION_PRECISION",
  "description": "Ensure the response includes only task relevant data and avoids hallucinations.",
  "expected_facts": [
    "Safeway price is $5.89",
    "Gus's price is $3.29",
    "Safeway delivery time is 34 minutes"
  ],
  "expected_reasonings": [
    "Reasoning to correctly apply the rule (>10% difference = price-based recommendation).",
    "Reasoning for final recommendation"
  ],
  "pass_threshold_percent": 80
}

3. Precision Judge Template¶

You are an information precision evaluator. Your task is to score how precisely and accurately the model response addresses the task without including irrelevant information or hallucinations.

TASK DESCRIPTION: {{description}}

EXPECTED FACTS TO BE INCLUDED:

{{#each expected_facts}}
{{@index}}. {{this}}
{{/each}}

EXPECTED REASONING TO BE INCLUDED:

{{#each expected_reasonings}}
{{@index}}. {{this}}
{{/each}}

MODEL RESPONSE: {{modelResponse}}

Scoring Instructions¶

Score	Description
5	Element present, accurate, and precisely stated without unnecessary elaboration
4	Element present and accurate with minor imprecision or slight irrelevant details
3	Element present but mixed with some irrelevant information or minor inaccuracies
2	Element partially present but obscured by significant irrelevant content or inaccuracies
1	Element missing, completely inaccurate, or so buried in irrelevant content as to be ineffective

Precision Penalty Rules¶

Reduce scores for any content that reduces information precision, including:

Hallucinated or fabricated information not supported by available data
Excessive irrelevant details that don't serve the task
Speculative statements presented as facts
Redundant or repetitive information
Off-topic elaborations or tangential discussions
Vague generalizations when specific information is required

Few-Shot Examples¶

Example 1 - Perfect Precision (Score: 5)

Expected: "Safeway price is $5.89"

Response: "Safeway's Clamato Tomato Juice costs $5.89."

Reasoning: Exact price stated clearly with the necessary product context only.

Example 2 - Minor Imprecision (Score: 4)

Expected: "Reasoning for final recommendation"

Response: "I recommend Gus's Community Market because it's $2.60 cheaper ($5.89 - $3.29), and with a 79% price difference exceeding the 10% threshold, price takes priority over the 5-minute delivery advantage Safeway offers."

Reasoning: Complete reasoning is present with slight additional calculation detail that's relevant but not essential.

Example 3 - Significant Irrelevant Content (Score: 2)

Expected: "Safeway delivery time is 34 minutes"

Response: "Safeway, a major grocery chain known for its wide selection of organic foods, premium deli section, and convenient pharmacy services, offers delivery in 34 minutes, which is competitive in today's fast-paced delivery market where customers expect quick service."

Reasoning: Correct delivery time is buried in excessive irrelevant information about store features and market context.

Example 4 - Hallucinated Information (Score: 1)

Expected: "Reasoning to correctly apply the rule (>10% difference = price-based recommendation)"

Response: "Based on customer reviews and store ratings, plus my analysis of seasonal pricing trends and regional market conditions, I believe price should be the determining factor."

Reasoning: Contains fabricated information (reviews, ratings, trends) not available in the task, completely missing the actual >10% rule.

Response Format¶

json { "operator": "INFORMATION_PRECISION", "precision_scores": { "Safeway price": 5, "Gus price": 4, "Safeway delivery time": 3, "Rule application reasoning": 4, "Final recommendation reasoning": 5 }, "error": null }!!! note Do not calculate averages or overall scores. Provide only individual precision scores as integers 1-5.