Writing Scenarios

Scenarios are the core of AXIS testing. Each scenario defines a task for the agent, criteria for judging success, and optional setup and teardown steps. Well-written scenarios produce consistent, meaningful scores.

Anatomy of a Scenario

A scenario is a JSON file in your configured scenarios directory. Here is a complete example:

{
  "name": "Debug and fix a broken script",
  "prompt": "There is a JavaScript file at src/add.js that has a bug. Find it, fix it, and verify the fix by running the test.",
  "rubric": [
    { "check": "Agent identified the bug (subtraction instead of addition)", "weight": 0.3 },
    { "check": "Agent fixed the bug so add(a, b) returns a + b", "weight": 0.4 },
    { "check": "Agent ran the test and it passed", "weight": 0.3 }
  ],
  "setup": [
    { "action": "run_script", "command": "mkdir -p src && echo 'function add(a,b) { return a-b; }\nmodule.exports = { add };' > src/add.js" },
    { "action": "run_script", "command": "mkdir -p test && echo 'const {add} = require(\"../src/add\");\nconsole.log(add(2,3) === 5 ? \"PASS\" : \"FAIL\");' > test/add.test.js" }
  ],
  "teardown": [
    { "action": "run_script", "command": "rm -rf src test" }
  ]
}

Field	Type	Required	Description
`name`	`string`	Yes	Human-readable title shown in reports and CLI output.
`prompt`	`string`	Yes	The task description sent to the agent.
`rubric`	`string \| object[]`	Yes	Success criteria: a plain string or array of weighted checks.
`setup`	`object[]`	No	Lifecycle actions run before the agent starts.
`teardown`	`object[]`	No	Lifecycle actions run after scoring completes.
`agents`	`string[]`	No	Override which agents run this scenario. Defaults to all configured agents.
`skills`	`string[]`	No	Scenario-specific skills, merged with top-level and agent-level skills.
`mcp_servers`	`object`	No	Scenario-specific MCP servers, merged with top-level MCP servers (scenario wins on name conflict).
`variants`	`object[]`	No	Run multiple configurations of the same scenario. See Variants.

Writing Effective Prompts

The prompt is what the agent sees as its task. The quality of your prompt directly affects the consistency and usefulness of your scores.

Be specific about what to do. "Fix the bug" is vague. "Find and fix the bug in src/add.js where subtraction is used instead of addition" gives the agent a clear target. Specific prompts produce more consistent results across runs.
Specify the expected output. If you want a file created, say what it should be called and what it should contain. If you want a test to pass, say which test command to run. Leaving the end state implicit forces the judge to guess what "success" means.
Scope appropriately. A prompt that asks the agent to "set up a full CI/CD pipeline" tests many things at once and makes it hard to isolate what went wrong. Smaller, focused scenarios produce more actionable scores.
Avoid giving the agent the answer. The point of testing is to observe how the agent discovers and solves the problem. If you tell the agent exactly which line to change, you are testing its ability to follow instructions, not its ability to debug.

Designing Rubrics

The rubric defines what "success" means. A judge LLM reads the agent's full transcript and evaluates each check on a 0 to 10 scale. Well-designed rubrics produce scores that reflect real quality differences.

Start simple: string rubrics

The simplest rubric is a plain string. The judge reads the transcript and gives a single 0 to 10 score for how well the agent met the description.

{
  "name": "Create an Express server",
  "prompt": "Create a working Express server that listens on port 3000.",
  "rubric": "The agent should create a working Express server on port 3000"
}

String rubrics work well for simple scenarios where you just want a holistic pass/fail judgment. The downside is that you get a single score with no visibility into what went right or wrong.

Add structure: check arrays

For more granular scoring, use an array of checks. Each check is evaluated independently, so you can see exactly which criteria the agent met and which it missed.

"rubric": [
  { "check": "Server starts on port 3000" },
  { "check": "GET / returns a 200 response" },
  { "check": "Server has error handling middleware" }
]

When no weight is specified, AXIS distributes weight equally across all checks. In this example, each check is worth one-third of the Goal Achievement score.

Control importance: weighted checks

Add weight to each check to control how much it contributes to the score. This is the recommended approach for most scenarios because it lets you express which outcomes matter most.

"rubric": [
  { "check": "Server starts on port 3000", "weight": 0.4 },
  { "check": "GET / returns a 200 response", "weight": 0.3 },
  { "check": "Server has error handling middleware", "weight": 0.3 }
]

In this example, the server starting is weighted highest because it is the core outcome. You can also mix weighted and unweighted checks -AXIS distributes the remaining weight equally across any checks without an explicit weight.

Writing good checks

Make checks observable. The judge evaluates checks by reading the transcript and examining the workspace. "Agent understood the problem" is hard to verify. "Agent modified src/add.js to use addition" is concrete and observable.
One assertion per check. "Agent fixed the bug and ran the test" is two things. If the agent fixes the bug but skips the test, it is unclear how to score this check. Split compound assertions into separate checks with their own weights.
Weight by importance. The core outcome (did it work?) should carry more weight than peripheral concerns (did it clean up after itself?). If every check has equal weight, a cosmetic failure impacts the score as much as a functional failure.

Strong vs Weak Checks

Weak: "Agent did a good job" -subjective, hard for the judge to score consistently.
Weak: "Agent used git" -checks behavior, not outcome. What if git was unnecessary?
Strong: "File output.csv contains at least 10 rows of valid CSV data" -concrete, verifiable.
Strong: "The test suite passes with npm test" -clear success criterion.

Setup and Teardown

Setup actions run before the agent starts. Use them to create the starting state that the scenario depends on: files to edit, databases to seed, servers to start.

Teardown actions run after scoring completes. Use them to clean up resources that should not persist between runs.

"setup": [
  { "action": "run_script", "command": "mkdir -p /tmp/workspace" },
  { "action": "run_script", "command": "cp -r fixtures/project/* /tmp/workspace/" }
],
"teardown": [
  { "action": "run_script", "command": "rm -rf /tmp/workspace" }
]

Lifecycle Details

Each action runs sequentially with a 30-second timeout. Setup failures abort the job and mark it as failed. Teardown failures are logged but do not block subsequent jobs or affect scores.

Common patterns

Create test fixtures: Use setup to write files that the agent will need to read, edit, or debug.
Seed data: Populate a database or create configuration files the agent should work with.
Initialize a project: Clone a repo, install dependencies, or set up a specific project state.
Clean up: Remove temp directories, stop background processes, or reset state in teardown.

Scenario Organization

The filename (without .json) becomes the scenario key used in reports, CLI commands, and baseline comparisons. Nested directories create namespaced keys.

scenarios/hello-world.json → key hello-world
scenarios/cms/create-post.json → key cms/create-post
scenarios/api/auth/login.json → key api/auth/login

Use directories to group related scenarios. Agents can be configured to run only specific groups using glob patterns in the agent configuration:

{
  "adapter": "claude-code",
  "scenarios": ["cms/*", "api/*"]
}

Agent-specific scenarios

Use the agents field in a scenario to restrict which agents run it. This is useful when a scenario depends on capabilities specific to one agent, or when you want to test different agents on different tasks.

{
  "name": "Use Claude Code MCP integration",
  "prompt": "Use the filesystem MCP server to list files in /tmp",
  "rubric": "Agent successfully used the MCP filesystem tool",
  "agents": ["claude-code"]
}

Variants

Variants let you run the same scenario under different configurations -different skills, MCP servers, prompts, or agent restrictions -without duplicating the entire scenario file. When variants is defined, the base scenario becomes a template: only the variants execute, each inheriting all fields from the parent. To also run the unmodified scenario as a control, add a variant with no overrides (the baseline pattern).

{
  "name": "Create a blog post",
  "prompt": "Create a new blog post titled 'Hello World' on the CMS.",
  "rubric": [
    { "check": "Blog post was created successfully", "weight": 0.5 },
    { "check": "Title matches 'Hello World'", "weight": 0.5 }
  ],
  "variants": [
    {
      "name": "with-netlify-mcp",
      "mcp_servers": {
        "netlify": { "type": "http", "url": "https://mcp.netlify.com" }
      }
    },
    {
      "name": "with-custom-skill",
      "skills": ["./skills/blog-helper"]
    },
    {
      "name": "alt-prompt",
      "prompt": "Create a draft blog post titled 'Hello World' without publishing."
    }
  ]
}

This produces three scenario keys: create-post@with-netlify-mcp, create-post@with-custom-skill, and create-post@alt-prompt. Each variant inherits the parent's rubric, setup, and other fields, then applies its own overrides. The variant name is appended to the scenario key with an @ separator.

Variant fields

Only name is required on a variant. All other fields are optional and inherit from the parent when omitted.

Field	Type	Behavior
`name`	`string`	Required. Must match `/^[a-zA-Z0-9_-]+$/`. Used in the scenario key.
`prompt`	`string`	Replaces the parent prompt.
`rubric`	`string \| object[]`	Replaces the parent rubric.
`skills`	`string[]`	Replaces the parent's scenario-level skills (top-level and agent-level skills still merge in).
`mcp_servers`	`object`	Merged with parent's scenario-level MCP servers (variant wins on name conflict).
`agents`	`string[]`	Replaces the parent agent restriction.
`setup` / `teardown`	`object[]`	Replaces the parent lifecycle actions.
`skip`	`boolean`	Overrides the parent skip flag.

Filtering Variants

CLI filters and agent-level scenarios globs work with variant keys. Filtering by the base key matches all its variants: --scenario create-post runs all three variants above. Use the full key to target a specific variant: --scenario create-post@with-netlify-mcp.

Example Scenarios

File creation Generate a README from scratch with no setup required.

{
  "name": "Create a README",
  "prompt": "Create a README.md file for a Node.js project called 'my-api'. Include a title, description, install instructions, and usage example.",
  "rubric": [
    { "check": "README.md exists" },
    { "check": "Contains a project title and description" },
    { "check": "Contains npm install instructions" },
    { "check": "Contains a usage or getting started example" }
  ]
}

Bug fix with verification Find and fix a bug in a pre-seeded project, then confirm the test passes.

{
  "name": "Fix failing test",
  "prompt": "The test in test/math.test.js is failing. Find the bug in src/math.js, fix it, and verify the test passes.",
  "rubric": [
    { "check": "Agent identified the root cause", "weight": 0.2 },
    { "check": "Agent fixed src/math.js correctly", "weight": 0.4 },
    { "check": "Agent ran the test and it passed", "weight": 0.4 }
  ],
  "setup": [
    { "action": "run_script", "command": "mkdir -p src test" },
    { "action": "run_script", "command": "echo 'exports.multiply = (a, b) => a + b;' > src/math.js" },
    { "action": "run_script", "command": "echo 'const {multiply} = require(\"../src/math\"); console.assert(multiply(3,4) === 12, \"Expected 12\"); console.log(\"PASS\");' > test/math.test.js" }
  ]
}

Multi-step with setup and teardown Add a new API endpoint, write tests, and clean up.

{
  "name": "Add API endpoint with tests",
  "prompt": "Add a GET /api/health endpoint to the Express app in src/app.js. It should return { status: 'ok', uptime: process.uptime() }. Write a test in test/health.test.js.",
  "rubric": [
    { "check": "GET /api/health endpoint exists", "weight": 0.3 },
    { "check": "Returns JSON with status and uptime", "weight": 0.3 },
    { "check": "Test file exists and covers the endpoint", "weight": 0.2 },
    { "check": "All tests pass", "weight": 0.2 }
  ],
  "setup": [
    { "action": "run_script", "command": "npm init -y && npm install express" },
    { "action": "run_script", "command": "mkdir -p src test" }
  ],
  "teardown": [
    { "action": "run_script", "command": "rm -rf node_modules package.json src test" }
  ]
}

Multi-variant scenario Test the same task with different tool configurations.

{
  "name": "Deploy a site",
  "prompt": "Deploy the project in the current directory to production.",
  "rubric": [
    { "check": "Site was deployed successfully", "weight": 0.5 },
    { "check": "Agent confirmed the deploy URL", "weight": 0.3 },
    { "check": "No errors in the deployment log", "weight": 0.2 }
  ],
  "setup": [
    { "action": "run_script", "command": "npm init -y && echo '<h1>Hello</h1>' > index.html" }
  ],
  "variants": [
    {
      "name": "baseline"
    },
    {
      "name": "with-mcp",
      "mcp_servers": {
        "netlify": { "type": "http", "url": "https://mcp.netlify.com" }
      }
    },
    {
      "name": "with-deploy-skill",
      "skills": ["./skills/deploy"]
    }
  ]
}

The first variant, baseline, has no overrides -it runs the scenario exactly as defined, giving you a control run to compare against. The other variants layer on different tool configurations. This produces three keys: deploy@baseline, deploy@with-mcp, and deploy@with-deploy-skill.