Task and Verifier Testing Process¶
How to test a task end-to-end, interpret verification results, and raise bugs when issues are found.
Key Concepts¶
Before you begin testing, make sure you understand these terms.
Task Types¶
Tasks are categorized by how they are verified. For full details on each type, see the OTS Prompting Guidelines.
| Type | What the agent does | What the verifier checks |
|---|---|---|
| NRD (Non Response Dependent) | Performs actions that change app state (e.g. add to cart, update a profile) | Whether the app state matches the expected outcome |
| RD (Response Dependent) | Analyzes data in the app and produces a text response (e.g. compare prices, recommend a product) | Whether the model response contains the correct information |
| Hybrid | Analyzes data and performs actions based on that analysis (e.g. find the cheapest item and add it to cart) | Both the app state and the model response |
Assertions¶
An assertion is a single check that the verifier runs against the app state or the model response. Each task has one or more assertions. When you run the verifier, every assertion is evaluated independently and reports pass or fail.
For example, a task that says "Add 2 items to cart" might have assertions like:
- Cart contains item A
- Cart contains item B
- Cart total equals expected value
Model Response¶
When agents are being trained, they interact with the gym and generate a response automatically. However, when manually testing, you (the human tester) need to play the role of the agent for tasks that require a response. This means:
- For NRD tasks: You only perform the required actions in the app. No model response is needed; the verifier checks app state only.
- For RD tasks: You analyze the content in the app as described in the prompt and provide a response that meets the prompt requirements perfectly.
- For Hybrid tasks: You perform the required actions and provide a model response covering the analysis the prompt requested.
When is a model response needed?
Only RD and Hybrid tasks require a model response. NRD tasks are verified entirely through app state changes.
Testing Steps¶
1. Reset State¶
On /verify_raw, use Reset State. This clears the stored state and reloads, so you start from a clean run.
2. Pick a Task¶
On /verify_raw, choose the task you want to test.
3. Refresh the Landing Page¶
Reload the main app. The app loads with the fresh state.
4. Perform the Task¶
In the app, follow the task instructions exactly as described in the task/prompt.
5. Provide a Model Response(only for RD and Hybrid tasks)¶
On /verify_raw, paste your response in the "Enter model response" field. Write the response as if you were an AI agent that just completed the task - include the facts, reasoning, or confirmations the prompt asks for.
6. Run the Verifier¶
On /verify_raw, find the task and run verification.
7. Interpret Results¶
- All assertions pass: The task was performed correctly and the verifier matches.
- Any assertion fails: Either the task was performed incorrectly (wrong steps, wrong data) or the verifier itself has an issue (expected state or assertions need to be fixed).
Don't blindly trust the verifier
The goal of task and verifier testing is to detect issues in both tasks and verifiers. If you believe a certain assertion is failing incorrectly, raise a bug for it with the Task & Verifier label. Include screenshots showing the app state, the assertion that failed, and why you believe it should pass. See the Bug Raising Guide for how to create the ticket.
Positive Path Testing¶
A positive path (happy path) is a way to perform the task correctly. Most tasks have multiple valid positive paths.
For example, if a task says "Add items A, B, and C to the cart":
- Adding in order A → B → C is one positive path
- Adding in order C → A → B is another positive path
- Adding them from different pages (search results vs. category page) is yet another
Test as many positive paths as you can
Different valid orderings, navigation routes, and interaction methods can reveal edge cases in the verifier. If a valid positive path causes an assertion to fail, that is a verifier bug.
Negative Path Testing¶
A negative path is performing the task incorrectly on purpose. The verifier should fail for every negative path. If it passes, the verifier is not catching the mistake - that is a bug.
How to Think About Negative Paths¶
There are infinite negative paths, so focus on ones that are close to the correct task - the kind of realistic mistake an agent could make.
| Strategy | Example |
|---|---|
| Wrong target | Sending an in-person message instead of an email |
| Wrong data | Adding the wrong product to the cart |
| Partial completion | Adding 2 of 3 required items |
| Wrong order of operations | Submitting a form before filling a required field |
| Similar but incorrect action | Editing a record instead of creating a new one |
| Off-by-one values | Setting quantity to 2 instead of 3 |
Negative Testing Checklist¶
- Swap the action type - Use a similar but wrong action (e.g. delete instead of archive)
- Use wrong data - Enter values that are close but incorrect (e.g. wrong contact, wrong amount)
- Skip a step - Omit one part of a multi-step task
- Perform extra actions - Do more than the task asks and check if the verifier still passes
- Target the wrong entity - Apply the action to a different record, page, or object
Rule of thumb
Think of a real mistake an agent could make when performing the original task. If the verifier does not catch that mistake, raise it as a Task & Verifier bug.
Verifying Task Tags¶
QAs should verify that tasks on the /verify_raw page are tagged with the correct type (RD, NRD, or Hybrid).
How to Check Tag Correctness¶
| Check | Expected behavior |
|---|---|
| RD and Hybrid tasks request model response as input | The /verify_raw page should show a model response input field for these tasks. The prompt should ask the agent to provide some information (e.g. a recommendation, comparison, or summary). |
| NRD and Hybrid tasks should fail when run with initial state | If you run the verifier without performing any actions (fresh state), NRD and Hybrid tasks must fail. If they pass on initial state, the verifier is not actually checking state changes; raise a Task & Verifier bug. |
Incorrectly tagged tasks
If a task is tagged as the wrong type (e.g. an NRD task tagged as RD), raise a Task & Verifier bug with the correct tag and your reasoning.
Raising Bugs During Testing¶
When testing reveals an issue, raise it in the correct GitHub project with the appropriate label. See the Bug Raising Guide for full instructions.
| What you found | Label to use | Example |
|---|---|---|
| A UI problem in the app (broken layout, wrong styling, missing element) | App | Button overlaps text on the settings page |
| Incorrect or missing data in the app | Data | Contact list shows wrong phone number |
| An assertion that fails when it should pass, passes when it should fail, or a task with wrong tags | Task & Verifier | Verifier passes even when wrong item is added to cart |
When in doubt
Refer to the Bug Raising Guide for details on required fields, screenshots, and how to set up the ticket correctly.