Skip to content

OTS Gym Specification

Overview

This document defines the required scope, architecture, workflow, and quality standards for OTS gyms. It is the single source of truth for all teams contributing to UI, data, tasks, verifiers, harness execution, and deployment.

Deliverables Checklist

Deliverable Notes and instructions
UI Scope document Upload to rl_temp and add the link in OTS-2 / Sales Tracker [RLGymUI]. Format requirements are in the UI section.
Gym repo with Dockerfile Dockerfile must meet the Docker requirements in this spec.
Harness reports (3) Required: 1-iteration OpenAI (may contain bugs), 8-iteration OpenAI (may contain bugs), 8-iteration all three models (must be bug-free). Upload all reports to rl_temp and add links in OTS-2 / Sales Tracker [RLGymUI].
Convert Excel sheet to Google Sheets Ensure the shared source of truth is a Google Sheet.
Lite version task ids Update column F in OTS-2 / Sales Tracker [RLGymUI] with 5 task ids, comma-separated with no spaces. Composition rules are in Tasks.
Docker images (tar files) Not needed; generated by script.

Architecture

  • Single port: UI and backend are served from port 3000. Backend APIs are under /api/v1.
  • Stateless backend: All state is managed in localStorage. Any backend data must be treated as immutable (no mutation of saved data). OTS gyms do not support the runid concept. Verification relies on the localStorage.json snapshot.
  • Time-sensitive features: Ensure time-dependent UI or data produces a consistent view for harness runs and verification.

UI Requirements

Security patching is mandatory

Certain React and Next.js versions have known vulnerabilities. Patch the gym using the approved guide before deployment. A previous Coolify server compromise was caused by this.

  • Scope document: Must follow the same format as Gmail - User Journeys. Use status values pending, partially, done. Do not use sprints.
  • Coverage: UI should cover at least 20–30% of the original gym UI features.
  • QA: UI must be thoroughly tested by dedicated QAs and be bug-free.
  • Time-sensitive UI: For features like restaurant open/close status (e.g., dashdoor), ensure a consistent harness-visible state or verification guidance.

Data Requirements

  • Data must be realistic and QA-validated (no duplicates such as two tickets with the same name).
  • Data can live in a dedicated DB or in code. For large datasets, fetch via backend APIs/SQL to avoid long startup delays from full seed transfer.

Tasks and Assertions

Task volume and composition

  • Task count: Minimum 30, maximum 40.
  • Response-dependent tasks: At least 10.
  • Lite version: Provide 5 task ids (1 easy, 1 medium, 3 difficult) based on harness results, not judgment. At least 2 must be response-dependent tasks.
  • Optional archiving: Extra tasks can be moved to archived-assertions.json so they do not appear in verify_raw.

Task format

Each task should include only the fields below. Format reference: https://github.com/turing-rlgym/deskzen-1/blob/ots/data/assertions.json

Field Type Notes
task_id string Unique id
prompt string Realistic prompt covering gym features
capabilities array of strings Use the same capability names as the UI scope document
assertions array JSONPath-based assertions (see Verifiers)

Prompt guidance

  • Prompts must be realistic and cover all available features.
  • Difficulty field is not required (difficulty is computed from harness results).

Assertions guidance

  • Include negative assertions to verify unchanged data (for example, check array lengths for entities that should not change).

Verifiers

Reference implementation

Dashdoor is the canonical integration reference: https://github.com/turing-rlgym/doordash-1/tree/ots

  • Integrate cua-gym-utils for:
  • get_actual_state API
  • get_expected_state API
  • verify_raw page
  • Implement the /localStorage endpoint so the harness can download browser state. Example: https://app.mira.rlgym.turing.com/localStorage
    Reference: https://github.com/turing-rlgym/mira/tree/main/app/localStorage
  • Use JSONPath-based assertions (same format as in: https://github.com/turing-rlgym/doordash-1/blob/ots/data/assertions.json).
  • Manually test each task and verify the checker passes. Run negative tests to ensure failures are detected (for example, add two items when only one is required).
  • For response-dependent task verification, follow the Verification Guide.

GitHub Packages auth for cua-gym-utils

  • Add .npmrc with @turing-rlgym:registry=https://npm.pkg.github.com
  • Docker build must pass token: docker build --build-arg GITHUB_TOKEN=$GITHUB_TOKEN
  • Use .npmrc only during build and remove it after npm install (do not commit secrets)

Git Workflow

All OTS changes must follow this workflow:

flowchart LR
  featureBranch["[feature/bug/fix branches]"]
  otsDev["[ots_dev] (PR review)"]
  otsRelease["[ots_release] (deployment)"]
  otsSync["[ots] (sync)"]
  otsSync --> otsDev
  featureBranch --> otsDev
  otsDev --> otsRelease
  otsRelease --> otsSync
  • ots_dev: PR creation and review
  • ots_release: deployment after PR approval
  • ots: sync branch (kept aligned with ots_dev and ots_release)

CI/CD and Automation

  • CI/CD is handled via GCP.

Docker

  • See Architecture for the single-port requirement and /api/v1 routing.
  • Any heavy data loading should occur during image creation time to avoid long cold starts.

Security requirements

  • Run containers as a non-root user.
  • Use .npmrc only during build and remove it after install.
  • Never commit secrets to git.
  • Dockerfile reference: doordash-1/Dockerfile

Harness

  • Run harness for all tasks (minimum 30 tasks).
  • Step 1: 1-iteration OpenAI only. Analyze the report and fix any gym or harness issues.
  • Step 2: 8-iteration OpenAI only. Analyze deeply, especially failures across all 8 iterations.
  • Step 3: 8-iteration all three models. This report can be reviewed lightly, but investigate any task failing all 8 iterations across all models.
  • Watch for Playwright crashes. These indicate UI elements that are human-usable but automation-unstable.
  • Confirm failures are not caused by time-sensitive features.
  • From the list of prompts, more than 75% should be model-breaking (percentage is in harness report).

Testing requirements

  • Manual testing of each task before harness runs.
  • Negative testing to verify expected failures.
  • QA sign-off required before ots_release deployment.

Issue Management

  • Log any data, UI, or harness issues in OTS Testing.
  • Coordinate harness issues with Muhammad Raveed Ahmad from the harness team.
  • UI and data issues are owned by the gym team.
  • Each gym must have a GitHub Projects board (created via the repo Projects section) for issue tracking.
  • One board per gym.
  • Required columns: Backlog, Ready, In progress, In review, Done.

Team Structure (For Leads)

  • Each gym must have a dedicated team with clear ownership.
  • Leads managing multiple gyms must ensure each gym has its own team.
  • Gym teams should sync daily to keep UI, data, prompts, and verifiers aligned.
  • Horizontals (prompts, verifiers, UI, data) can sync less frequently (about twice a week).
  • Focus teams on shipping a complete gym with frequent releases.