OTS Gym Specification¶

Overview¶

This document defines the required scope, architecture, workflow, and quality standards for OTS gyms. It is the single source of truth for all teams contributing to UI, data, tasks, verifiers, harness execution, and deployment.

Deliverables Checklist¶

Deliverable	Notes and instructions
UI Scope document	Upload to rl_temp and add the link in OTS-2 / Sales Tracker [RLGymUI]. Format requirements are in the UI section.
Gym repo with Dockerfile	Dockerfile must meet the Docker requirements in this spec.
Harness reports (3)	Required: 1-iteration OpenAI (may contain bugs), 8-iteration OpenAI (may contain bugs), 8-iteration all three models (must be bug-free). Upload all reports to rl_temp and add links in OTS-2 / Sales Tracker [RLGymUI].
Convert Excel sheet to Google Sheets	Ensure the shared source of truth is a Google Sheet.
Lite version task ids	Update column F in OTS-2 / Sales Tracker [RLGymUI] with 5 task ids, comma-separated with no spaces. Composition rules are in Tasks.
Docker images (tar files)	Not needed; generated by script.

Architecture¶

Single port: UI and backend are served from port 3000. Backend APIs are under /api/v1.
Stateless backend: All state is managed in localStorage. Any backend data must be treated as immutable (no mutation of saved data). OTS gyms do not support the runid concept. Verification relies on the localStorage.json snapshot.
Time-sensitive features: Ensure time-dependent UI or data produces a consistent view for harness runs and verification.

UI Requirements¶

Security patching is mandatory

Certain React and Next.js versions have known vulnerabilities. Patch the gym using the approved guide before deployment. A previous Coolify server compromise was caused by this.

Scope document: Must follow the same format as Gmail - User Journeys. Use status values pending, partially, done. Do not use sprints.
Coverage: UI should cover at least 20–30% of the original gym UI features.
QA: UI must be thoroughly tested by dedicated QAs and be bug-free.
Time-sensitive UI: For features like restaurant open/close status (e.g., dashdoor), ensure a consistent harness-visible state or verification guidance.

Data Requirements¶

Data must be realistic and QA-validated (no duplicates such as two tickets with the same name).
Data can live in a dedicated DB or in code. For large datasets, fetch via backend APIs/SQL to avoid long startup delays from full seed transfer.

Tasks and Assertions¶

Task volume and composition¶

Task count: Minimum 30, maximum 40.
Response-dependent tasks: At least 10.
Lite version: Provide 5 task ids (1 easy, 1 medium, 3 difficult) based on harness results, not judgment. At least 2 must be response-dependent tasks.
Optional archiving: Extra tasks can be moved to archived-assertions.json so they do not appear in verify_raw.

Task format¶

Each task should include only the fields below. Format reference: https://github.com/turing-rlgym/deskzen-1/blob/ots/data/assertions.json

Field	Type	Notes
`task_id`	string	Unique id
`prompt`	string	Realistic prompt covering gym features
`capabilities`	array of strings	Use the same capability names as the UI scope document
`assertions`	array	JSONPath-based assertions (see Verifiers)

Prompt guidance¶

Prompts must be realistic and cover all available features.
Difficulty field is not required (difficulty is computed from harness results).

Assertions guidance¶

Include negative assertions to verify unchanged data (for example, check array lengths for entities that should not change).

Verifiers¶

Reference implementation

Dashdoor is the canonical integration reference: https://github.com/turing-rlgym/doordash-1/tree/ots

Integrate cua-gym-utils for:
get_actual_state API
get_expected_state API
verify_raw page
Implement the /localStorage endpoint so the harness can download browser state. Example: https://app.mira.rlgym.turing.com/localStorage
Reference: https://github.com/turing-rlgym/mira/tree/main/app/localStorage
Use JSONPath-based assertions (same format as in: https://github.com/turing-rlgym/doordash-1/blob/ots/data/assertions.json).
Manually test each task and verify the checker passes. Run negative tests to ensure failures are detected (for example, add two items when only one is required).
For response-dependent task verification, follow the Verification Guide.

GitHub Packages auth for `cua-gym-utils`¶

Add .npmrc with @turing-rlgym:registry=https://npm.pkg.github.com
Docker build must pass token: docker build --build-arg GITHUB_TOKEN=$GITHUB_TOKEN
Use .npmrc only during build and remove it after npm install (do not commit secrets)

Git Workflow¶

All OTS changes must follow this workflow:

flowchart LR
  featureBranch["[feature/bug/fix branches]"]
  otsDev["[ots_dev] (PR review)"]
  otsRelease["[ots_release] (deployment)"]
  otsSync["[ots] (sync)"]
  otsSync --> otsDev
  featureBranch --> otsDev
  otsDev --> otsRelease
  otsRelease --> otsSync

ots_dev: PR creation and review
ots_release: deployment after PR approval
ots: sync branch (kept aligned with ots_dev and ots_release)

CI/CD and Automation¶

CI/CD is handled via GCP.

Docker¶

See Architecture for the single-port requirement and /api/v1 routing.
Any heavy data loading should occur during image creation time to avoid long cold starts.

Security requirements¶

Run containers as a non-root user.
Use .npmrc only during build and remove it after install.
Never commit secrets to git.
Dockerfile reference: doordash-1/Dockerfile

Harness¶

Run harness for all tasks (minimum 30 tasks).
Step 1: 1-iteration OpenAI only. Analyze the report and fix any gym or harness issues.
Step 2: 8-iteration OpenAI only. Analyze deeply, especially failures across all 8 iterations.
Step 3: 8-iteration all three models. This report can be reviewed lightly, but investigate any task failing all 8 iterations across all models.
Watch for Playwright crashes. These indicate UI elements that are human-usable but automation-unstable.
Confirm failures are not caused by time-sensitive features.
From the list of prompts, more than 75% should be model-breaking (percentage is in harness report).

Testing requirements¶

Manual testing of each task before harness runs.
Negative testing to verify expected failures.
QA sign-off required before ots_release deployment.

Issue Management¶

Log any data, UI, or harness issues in OTS Testing.
Coordinate harness issues with Muhammad Raveed Ahmad from the harness team.
UI and data issues are owned by the gym team.
Each gym must have a GitHub Projects board (created via the repo Projects section) for issue tracking.
One board per gym.
Required columns: Backlog, Ready, In progress, In review, Done.

Team Structure (For Leads)¶

Each gym must have a dedicated team with clear ownership.
Leads managing multiple gyms must ensure each gym has its own team.
Gym teams should sync daily to keep UI, data, prompts, and verifiers aligned.
Horizontals (prompts, verifiers, UI, data) can sync less frequently (about twice a week).
Focus teams on shipping a complete gym with frequent releases.