Writing Scenarios

Scenarios are the core of AXIS testing. Each scenario defines a task for the agent, criteria for judging success, and optional setup and teardown steps. Well-written scenarios produce consistent, meaningful scores.

Anatomy of a Scenario

A scenario is a JSON file in your configured scenarios directory. Here is a complete example:

{
  "name": "Debug and fix a broken script",
  "prompt": "There is a JavaScript file at src/add.js that has a bug. Find it, fix it, and verify the fix by running the test.",
  "rubric": [
    { "check": "Agent identified the bug (subtraction instead of addition)", "weight": 0.3 },
    { "check": "Agent fixed the bug so add(a, b) returns a + b", "weight": 0.4 },
    { "check": "Agent ran the test and it passed", "weight": 0.3 }
  ],
  "setup": [
    { "action": "run_script", "command": "mkdir -p src && echo 'function add(a,b) { return a-b; }\nmodule.exports = { add };' > src/add.js" },
    { "action": "run_script", "command": "mkdir -p test && echo 'const {add} = require(\"../src/add\");\nconsole.log(add(2,3) === 5 ? \"PASS\" : \"FAIL\");' > test/add.test.js" }
  ],
  "teardown": [
    { "action": "run_script", "command": "rm -rf src test" }
  ]
}
Field Type Required Description
name string Yes Human-readable title shown in reports and CLI output.
prompt string Yes The task description sent to the agent.
rubric string | object[] Yes Success criteria: a plain string or array of weighted checks.
setup object[] No Lifecycle actions run before the agent starts.
teardown object[] No Lifecycle actions run after scoring completes.
agents string[] No Override which agents run this scenario. Defaults to all configured agents.
skills string[] No Scenario-specific skills, merged with top-level and agent-level skills.
mcp_servers object No Scenario-specific MCP servers, merged with top-level MCP servers (scenario wins on name conflict).
variants object[] No Run multiple configurations of the same scenario. See Variants.

Writing Effective Prompts

The prompt is what the agent sees as its task. The quality of your prompt directly affects the consistency and usefulness of your scores.

Designing Rubrics

The rubric defines what "success" means. A judge LLM reads the agent's full transcript and evaluates each check on a 0 to 10 scale. Well-designed rubrics produce scores that reflect real quality differences.

Start simple: string rubrics

The simplest rubric is a plain string. The judge reads the transcript and gives a single 0 to 10 score for how well the agent met the description.

{
  "name": "Create an Express server",
  "prompt": "Create a working Express server that listens on port 3000.",
  "rubric": "The agent should create a working Express server on port 3000"
}

String rubrics work well for simple scenarios where you just want a holistic pass/fail judgment. The downside is that you get a single score with no visibility into what went right or wrong.

Add structure: check arrays

For more granular scoring, use an array of checks. Each check is evaluated independently, so you can see exactly which criteria the agent met and which it missed.

"rubric": [
  { "check": "Server starts on port 3000" },
  { "check": "GET / returns a 200 response" },
  { "check": "Server has error handling middleware" }
]

When no weight is specified, AXIS distributes weight equally across all checks. In this example, each check is worth one-third of the Goal Achievement score.

Control importance: weighted checks

Add weight to each check to control how much it contributes to the score. This is the recommended approach for most scenarios because it lets you express which outcomes matter most.

"rubric": [
  { "check": "Server starts on port 3000", "weight": 0.4 },
  { "check": "GET / returns a 200 response", "weight": 0.3 },
  { "check": "Server has error handling middleware", "weight": 0.3 }
]

In this example, the server starting is weighted highest because it is the core outcome. You can also mix weighted and unweighted checks -AXIS distributes the remaining weight equally across any checks without an explicit weight.

Writing good checks

Strong vs Weak Checks

Weak: "Agent did a good job" -subjective, hard for the judge to score consistently.
Weak: "Agent used git" -checks behavior, not outcome. What if git was unnecessary?
Strong: "File output.csv contains at least 10 rows of valid CSV data" -concrete, verifiable.
Strong: "The test suite passes with npm test" -clear success criterion.

Setup and Teardown

Setup actions run before the agent starts. Use them to create the starting state that the scenario depends on: files to edit, databases to seed, servers to start.

Teardown actions run after scoring completes. Use them to clean up resources that should not persist between runs.

"setup": [
  { "action": "run_script", "command": "mkdir -p /tmp/workspace" },
  { "action": "run_script", "command": "cp -r fixtures/project/* /tmp/workspace/" }
],
"teardown": [
  { "action": "run_script", "command": "rm -rf /tmp/workspace" }
]
Lifecycle Details

Each action runs sequentially with a 30-second timeout. Setup failures abort the job and mark it as failed. Teardown failures are logged but do not block subsequent jobs or affect scores.

Common patterns

Scenario Organization

The filename (without .json) becomes the scenario key used in reports, CLI commands, and baseline comparisons. Nested directories create namespaced keys.

Use directories to group related scenarios. Agents can be configured to run only specific groups using glob patterns in the agent configuration:

{
  "adapter": "claude-code",
  "scenarios": ["cms/*", "api/*"]
}

Agent-specific scenarios

Use the agents field in a scenario to restrict which agents run it. This is useful when a scenario depends on capabilities specific to one agent, or when you want to test different agents on different tasks.

{
  "name": "Use Claude Code MCP integration",
  "prompt": "Use the filesystem MCP server to list files in /tmp",
  "rubric": "Agent successfully used the MCP filesystem tool",
  "agents": ["claude-code"]
}

Variants

Variants let you run the same scenario under different configurations -different skills, MCP servers, prompts, or agent restrictions -without duplicating the entire scenario file. When variants is defined, the base scenario becomes a template: only the variants execute, each inheriting all fields from the parent. To also run the unmodified scenario as a control, add a variant with no overrides (the baseline pattern).

{
  "name": "Create a blog post",
  "prompt": "Create a new blog post titled 'Hello World' on the CMS.",
  "rubric": [
    { "check": "Blog post was created successfully", "weight": 0.5 },
    { "check": "Title matches 'Hello World'", "weight": 0.5 }
  ],
  "variants": [
    {
      "name": "with-netlify-mcp",
      "mcp_servers": {
        "netlify": { "type": "http", "url": "https://mcp.netlify.com" }
      }
    },
    {
      "name": "with-custom-skill",
      "skills": ["./skills/blog-helper"]
    },
    {
      "name": "alt-prompt",
      "prompt": "Create a draft blog post titled 'Hello World' without publishing."
    }
  ]
}

This produces three scenario keys: create-post@with-netlify-mcp, create-post@with-custom-skill, and create-post@alt-prompt. Each variant inherits the parent's rubric, setup, and other fields, then applies its own overrides. The variant name is appended to the scenario key with an @ separator.

Variant fields

Only name is required on a variant. All other fields are optional and inherit from the parent when omitted.

Field Type Behavior
name string Required. Must match /^[a-zA-Z0-9_-]+$/. Used in the scenario key.
prompt string Replaces the parent prompt.
rubric string | object[] Replaces the parent rubric.
skills string[] Replaces the parent's scenario-level skills (top-level and agent-level skills still merge in).
mcp_servers object Merged with parent's scenario-level MCP servers (variant wins on name conflict).
agents string[] Replaces the parent agent restriction.
setup / teardown object[] Replaces the parent lifecycle actions.
skip boolean Overrides the parent skip flag.
Filtering Variants

CLI filters and agent-level scenarios globs work with variant keys. Filtering by the base key matches all its variants: --scenario create-post runs all three variants above. Use the full key to target a specific variant: --scenario create-post@with-netlify-mcp.

Example Scenarios

File creation Generate a README from scratch with no setup required.
{
  "name": "Create a README",
  "prompt": "Create a README.md file for a Node.js project called 'my-api'. Include a title, description, install instructions, and usage example.",
  "rubric": [
    { "check": "README.md exists" },
    { "check": "Contains a project title and description" },
    { "check": "Contains npm install instructions" },
    { "check": "Contains a usage or getting started example" }
  ]
}
Bug fix with verification Find and fix a bug in a pre-seeded project, then confirm the test passes.
{
  "name": "Fix failing test",
  "prompt": "The test in test/math.test.js is failing. Find the bug in src/math.js, fix it, and verify the test passes.",
  "rubric": [
    { "check": "Agent identified the root cause", "weight": 0.2 },
    { "check": "Agent fixed src/math.js correctly", "weight": 0.4 },
    { "check": "Agent ran the test and it passed", "weight": 0.4 }
  ],
  "setup": [
    { "action": "run_script", "command": "mkdir -p src test" },
    { "action": "run_script", "command": "echo 'exports.multiply = (a, b) => a + b;' > src/math.js" },
    { "action": "run_script", "command": "echo 'const {multiply} = require(\"../src/math\"); console.assert(multiply(3,4) === 12, \"Expected 12\"); console.log(\"PASS\");' > test/math.test.js" }
  ]
}
Multi-step with setup and teardown Add a new API endpoint, write tests, and clean up.
{
  "name": "Add API endpoint with tests",
  "prompt": "Add a GET /api/health endpoint to the Express app in src/app.js. It should return { status: 'ok', uptime: process.uptime() }. Write a test in test/health.test.js.",
  "rubric": [
    { "check": "GET /api/health endpoint exists", "weight": 0.3 },
    { "check": "Returns JSON with status and uptime", "weight": 0.3 },
    { "check": "Test file exists and covers the endpoint", "weight": 0.2 },
    { "check": "All tests pass", "weight": 0.2 }
  ],
  "setup": [
    { "action": "run_script", "command": "npm init -y && npm install express" },
    { "action": "run_script", "command": "mkdir -p src test" }
  ],
  "teardown": [
    { "action": "run_script", "command": "rm -rf node_modules package.json src test" }
  ]
}
Multi-variant scenario Test the same task with different tool configurations.
{
  "name": "Deploy a site",
  "prompt": "Deploy the project in the current directory to production.",
  "rubric": [
    { "check": "Site was deployed successfully", "weight": 0.5 },
    { "check": "Agent confirmed the deploy URL", "weight": 0.3 },
    { "check": "No errors in the deployment log", "weight": 0.2 }
  ],
  "setup": [
    { "action": "run_script", "command": "npm init -y && echo '<h1>Hello</h1>' > index.html" }
  ],
  "variants": [
    {
      "name": "baseline"
    },
    {
      "name": "with-mcp",
      "mcp_servers": {
        "netlify": { "type": "http", "url": "https://mcp.netlify.com" }
      }
    },
    {
      "name": "with-deploy-skill",
      "skills": ["./skills/deploy"]
    }
  ]
}

The first variant, baseline, has no overrides -it runs the scenario exactly as defined, giving you a control run to compare against. The other variants layer on different tool configurations. This produces three keys: deploy@baseline, deploy@with-mcp, and deploy@with-deploy-skill.