Task and Verifier Testing Process¶

How to test a task end-to-end, interpret verification results, and raise bugs when issues are found.

Key Concepts¶

Before you begin testing, make sure you understand these terms.

Task Types¶

Tasks are categorized by how they are verified. For full details on each type, see the OTS Prompting Guidelines.

Type	What the agent does	What the verifier checks
NRD (Non Response Dependent)	Performs actions that change app state (e.g. add to cart, update a profile)	Whether the app state matches the expected outcome
RD (Response Dependent)	Analyzes data in the app and produces a text response (e.g. compare prices, recommend a product)	Whether the model response contains the correct information
Hybrid	Analyzes data and performs actions based on that analysis (e.g. find the cheapest item and add it to cart)	Both the app state and the model response

Assertions¶

An assertion is a single check that the verifier runs against the app state or the model response. Each task has one or more assertions. When you run the verifier, every assertion is evaluated independently and reports pass or fail.

For example, a task that says "Add 2 items to cart" might have assertions like:

Cart contains item A
Cart contains item B
Cart total equals expected value

Model Response¶

When agents are being trained, they interact with the gym and generate a response automatically. However, when manually testing, you (the human tester) need to play the role of the agent for tasks that require a response. This means:

For NRD tasks: You only perform the required actions in the app. No model response is needed; the verifier checks app state only.
For RD tasks: You analyze the content in the app as described in the prompt and provide a response that meets the prompt requirements perfectly.
For Hybrid tasks: You perform the required actions and provide a model response covering the analysis the prompt requested.

When is a model response needed?

Only RD and Hybrid tasks require a model response. NRD tasks are verified entirely through app state changes.

Testing Steps¶

1. Reset State¶

On /verify_raw, use Reset State. This clears the stored state and reloads, so you start from a clean run.

2. Pick a Task¶

On /verify_raw, choose the task you want to test.

3. Refresh the Landing Page¶

Reload the main app. The app loads with the fresh state.

4. Perform the Task¶

In the app, follow the task instructions exactly as described in the task/prompt.

5. Provide a Model Response(only for RD and Hybrid tasks)¶

On /verify_raw, paste your response in the "Enter model response" field. Write the response as if you were an AI agent that just completed the task - include the facts, reasoning, or confirmations the prompt asks for.

6. Run the Verifier¶

On /verify_raw, find the task and run verification.

7. Interpret Results¶

All assertions pass: The task was performed correctly and the verifier matches.
Any assertion fails: Either the task was performed incorrectly (wrong steps, wrong data) or the verifier itself has an issue (expected state or assertions need to be fixed).

Don't blindly trust the verifier

The goal of task and verifier testing is to detect issues in both tasks and verifiers. If you believe a certain assertion is failing incorrectly, raise a bug for it with the Task & Verifier label. Include screenshots showing the app state, the assertion that failed, and why you believe it should pass. See the Bug Raising Guide for how to create the ticket.

Positive Path Testing¶

A positive path (happy path) is a way to perform the task correctly. Most tasks have multiple valid positive paths.

For example, if a task says "Add items A, B, and C to the cart":

Adding in order A → B → C is one positive path
Adding in order C → A → B is another positive path
Adding them from different pages (search results vs. category page) is yet another

Test as many positive paths as you can

Different valid orderings, navigation routes, and interaction methods can reveal edge cases in the verifier. If a valid positive path causes an assertion to fail, that is a verifier bug.

Negative Path Testing¶

A negative path is performing the task incorrectly on purpose. The verifier should fail for every negative path. If it passes, the verifier is not catching the mistake - that is a bug.

How to Think About Negative Paths¶

There are infinite negative paths, so focus on ones that are close to the correct task - the kind of realistic mistake an agent could make.

Strategy	Example
Wrong target	Sending an in-person message instead of an email
Wrong data	Adding the wrong product to the cart
Partial completion	Adding 2 of 3 required items
Wrong order of operations	Submitting a form before filling a required field
Similar but incorrect action	Editing a record instead of creating a new one
Off-by-one values	Setting quantity to 2 instead of 3

Negative Testing Checklist¶

Swap the action type - Use a similar but wrong action (e.g. delete instead of archive)
Use wrong data - Enter values that are close but incorrect (e.g. wrong contact, wrong amount)
Skip a step - Omit one part of a multi-step task
Perform extra actions - Do more than the task asks and check if the verifier still passes
Target the wrong entity - Apply the action to a different record, page, or object

Rule of thumb

Think of a real mistake an agent could make when performing the original task. If the verifier does not catch that mistake, raise it as a Task & Verifier bug.

Verifying Task Tags¶

QAs should verify that tasks on the /verify_raw page are tagged with the correct type (RD, NRD, or Hybrid).

How to Check Tag Correctness¶

Check	Expected behavior
RD and Hybrid tasks request model response as input	The `/verify_raw` page should show a model response input field for these tasks. The prompt should ask the agent to provide some information (e.g. a recommendation, comparison, or summary).
NRD and Hybrid tasks should fail when run with initial state	If you run the verifier without performing any actions (fresh state), NRD and Hybrid tasks must fail. If they pass on initial state, the verifier is not actually checking state changes; raise a Task & Verifier bug.

Incorrectly tagged tasks

If a task is tagged as the wrong type (e.g. an NRD task tagged as RD), raise a Task & Verifier bug with the correct tag and your reasoning.

Raising Bugs During Testing¶

When testing reveals an issue, raise it in the correct GitHub project with the appropriate label. See the Bug Raising Guide for full instructions.

What you found	Label to use	Example
A UI problem in the app (broken layout, wrong styling, missing element)	App	Button overlaps text on the settings page
Incorrect or missing data in the app	Data	Contact list shows wrong phone number
An assertion that fails when it should pass, passes when it should fail, or a task with wrong tags	Task & Verifier	Verifier passes even when wrong item is added to cart

When in doubt

Refer to the Bug Raising Guide for details on required fields, screenshots, and how to set up the ticket correctly.