OTS Gym Specification¶
Overview¶
This document defines the required scope, architecture, workflow, and quality standards for OTS gyms. It is the single source of truth for all teams contributing to UI, data, tasks, verifiers, harness execution, and deployment.
Deliverables Checklist¶
| Deliverable | Notes and instructions |
|---|---|
| UI Scope document | Upload to rl_temp and add the link in OTS-2 / Sales Tracker [RLGymUI]. Format requirements are in the UI section. |
| Gym repo with Dockerfile | Dockerfile must meet the Docker requirements in this spec. |
| Harness reports (3) | Required: 1-iteration OpenAI (may contain bugs), 8-iteration OpenAI (may contain bugs), 8-iteration all three models (must be bug-free). Upload all reports to rl_temp and add links in OTS-2 / Sales Tracker [RLGymUI]. |
| Convert Excel sheet to Google Sheets | Ensure the shared source of truth is a Google Sheet. |
| Lite version task ids | Update column F in OTS-2 / Sales Tracker [RLGymUI] with 5 task ids, comma-separated with no spaces. Composition rules are in Tasks. |
| Docker images (tar files) | Not needed; generated by script. |
Architecture¶
- Single port: UI and backend are served from port
3000. Backend APIs are under/api/v1. - Stateless backend: All state is managed in
localStorage. Any backend data must be treated as immutable (no mutation of saved data). OTS gyms do not support therunidconcept. Verification relies on thelocalStorage.jsonsnapshot. - Time-sensitive features: Ensure time-dependent UI or data produces a consistent view for harness runs and verification.
UI Requirements¶
Security patching is mandatory
Certain React and Next.js versions have known vulnerabilities. Patch the gym using the approved guide before deployment. A previous Coolify server compromise was caused by this.
- Scope document: Must follow the same format as Gmail - User Journeys. Use status values
pending,partially,done. Do not use sprints. - Coverage: UI should cover at least 20–30% of the original gym UI features.
- QA: UI must be thoroughly tested by dedicated QAs and be bug-free.
- Time-sensitive UI: For features like restaurant open/close status (e.g., dashdoor), ensure a consistent harness-visible state or verification guidance.
Data Requirements¶
- Data must be realistic and QA-validated (no duplicates such as two tickets with the same name).
- Data can live in a dedicated DB or in code. For large datasets, fetch via backend APIs/SQL to avoid long startup delays from full seed transfer.
Tasks and Assertions¶
Task volume and composition¶
- Task count: Minimum 30, maximum 40.
- Response-dependent tasks: At least 10.
- Lite version: Provide 5 task ids (1 easy, 1 medium, 3 difficult) based on harness results, not judgment. At least 2 must be response-dependent tasks.
- Optional archiving: Extra tasks can be moved to
archived-assertions.jsonso they do not appear inverify_raw.
Task format¶
Each task should include only the fields below. Format reference: https://github.com/turing-rlgym/deskzen-1/blob/ots/data/assertions.json
| Field | Type | Notes |
|---|---|---|
task_id |
string | Unique id |
prompt |
string | Realistic prompt covering gym features |
capabilities |
array of strings | Use the same capability names as the UI scope document |
assertions |
array | JSONPath-based assertions (see Verifiers) |
Prompt guidance¶
- Prompts must be realistic and cover all available features.
- Difficulty field is not required (difficulty is computed from harness results).
Assertions guidance¶
- Include negative assertions to verify unchanged data (for example, check array lengths for entities that should not change).
Verifiers¶
Reference implementation
Dashdoor is the canonical integration reference: https://github.com/turing-rlgym/doordash-1/tree/ots
- Integrate
cua-gym-utilsfor: get_actual_stateAPIget_expected_stateAPIverify_rawpage- Implement the
/localStorageendpoint so the harness can download browser state. Example: https://app.mira.rlgym.turing.com/localStorage
Reference: https://github.com/turing-rlgym/mira/tree/main/app/localStorage - Use JSONPath-based assertions (same format as in: https://github.com/turing-rlgym/doordash-1/blob/ots/data/assertions.json).
- Manually test each task and verify the checker passes. Run negative tests to ensure failures are detected (for example, add two items when only one is required).
- For response-dependent task verification, follow the Verification Guide.
GitHub Packages auth for cua-gym-utils¶
- Add
.npmrcwith@turing-rlgym:registry=https://npm.pkg.github.com - Docker build must pass token:
docker build --build-arg GITHUB_TOKEN=$GITHUB_TOKEN - Use
.npmrconly during build and remove it afternpm install(do not commit secrets)
Git Workflow¶
All OTS changes must follow this workflow:
flowchart LR
featureBranch["[feature/bug/fix branches]"]
otsDev["[ots_dev] (PR review)"]
otsRelease["[ots_release] (deployment)"]
otsSync["[ots] (sync)"]
otsSync --> otsDev
featureBranch --> otsDev
otsDev --> otsRelease
otsRelease --> otsSync
ots_dev: PR creation and reviewots_release: deployment after PR approvalots: sync branch (kept aligned withots_devandots_release)
CI/CD and Automation¶
- CI/CD is handled via GCP.
Docker¶
- See Architecture for the single-port requirement and
/api/v1routing. - Any heavy data loading should occur during image creation time to avoid long cold starts.
Security requirements¶
- Run containers as a non-root user.
- Use
.npmrconly during build and remove it after install. - Never commit secrets to git.
- Dockerfile reference: doordash-1/Dockerfile
Harness¶
- Run harness for all tasks (minimum 30 tasks).
- Step 1: 1-iteration OpenAI only. Analyze the report and fix any gym or harness issues.
- Step 2: 8-iteration OpenAI only. Analyze deeply, especially failures across all 8 iterations.
- Step 3: 8-iteration all three models. This report can be reviewed lightly, but investigate any task failing all 8 iterations across all models.
- Watch for Playwright crashes. These indicate UI elements that are human-usable but automation-unstable.
- Confirm failures are not caused by time-sensitive features.
- From the list of prompts, more than 75% should be model-breaking (percentage is in harness report).
Testing requirements¶
- Manual testing of each task before harness runs.
- Negative testing to verify expected failures.
- QA sign-off required before
ots_releasedeployment.
Issue Management¶
- Log any data, UI, or harness issues in OTS Testing.
- Coordinate harness issues with Muhammad Raveed Ahmad from the harness team.
- UI and data issues are owned by the gym team.
- Each gym must have a GitHub Projects board (created via the repo Projects section) for issue tracking.
- One board per gym.
- Required columns: Backlog, Ready, In progress, In review, Done.
Team Structure (For Leads)¶
- Each gym must have a dedicated team with clear ownership.
- Leads managing multiple gyms must ensure each gym has its own team.
- Gym teams should sync daily to keep UI, data, prompts, and verifiers aligned.
- Horizontals (prompts, verifiers, UI, data) can sync less frequently (about twice a week).
- Focus teams on shipping a complete gym with frequent releases.