AXIS Scoring Framework

AXIS produces a composite 0 to 100 AXIS Result by evaluating four independent dimensions of agent performance. Each dimension captures a different aspect of how the agent interacted with your system, giving you a clear picture of where to improve.

The Four Dimensions

20% Agent Planning, reasoning, self-organization 40% Goal Achievement Evaluated against your scenario rubric 20% Service APIs, MCP tools, network requests 20% Environment Shell, filesystem, git, build tools

Composite AXIS Result

The final AXIS Result is the weighted average of all four dimension scores.

AXIS Result = (Goal Achievement x 0.4) + (Environment x 0.2) + (Service x 0.2) + (Agent x 0.2)

All dimension scores are 0 to 100. The composite result is rounded to the nearest whole number. Weights are configurable in your axis.config.json and must sum to 1.0.

Goal Achievement

Goal Achievement is evaluated by a judge that reads the full agent transcript and compares the outcome against the rubric checks you defined in the scenario. Each check receives a score from 0 to 10, which is scaled to 0 to 100 and combined using the check weights.

This is the only dimension driven entirely by your rubric. The other three dimensions are calculated from the agent's interaction transcript using a separate evaluation pipeline.

Goal Achievement carries the highest default weight (40%) because the most fundamental question is whether the agent accomplished what you asked it to do. The other three dimensions measure how it got there.

Environment

Environment measures how well local tools and the runtime performed during execution. It answers the question: did the tools the agent used work correctly and respond quickly?

This dimension covers OS, filesystem, and dev tooling interactions:

Shell: bash, shell, terminal, exec.
File ops: read, write, edit, glob, grep, cat, head, tail, find, ls, mkdir, rm, cp, mv.
Version control: git.
Package managers: npm, yarn, pip, cargo, go, brew, apt.
Build and test: make, tsc, docker, kubectl, node, python.

Environment is scored on Success (0.7) and Speed (0.3) only. It does not evaluate whether the agent should have run a command -only whether the command itself executed correctly and promptly. A low Environment score points to tool failures or slow local operations, not poor agent decisions.

Service

Service measures how well external services and APIs performed. It answers the question: did the APIs and remote tools the agent called respond successfully and within reasonable time?

This dimension covers external APIs, MCP tools, network calls, and custom services. Any tool interaction not matching the environment or agent patterns is classified as a service interaction.

Like Environment, Service is scored on Success (0.7) and Speed (0.3) only. It evaluates execution quality, not decision quality. A low Service score means external services are failing or responding slowly -it does not reflect on the agent's choice to call them.

Agent

Agent measures the quality of the agent's decisions across the entire execution. It answers the question: did the agent make good choices about what to do and how to do it?

This dimension directly covers self-organization and metacognition interactions:

Tool discovery: toolsearch, listtoolsets, list_tools.
Task management: taskcreate, taskupdate, tasklist, todo_read, todo_write.
Planning: enterplanmode, exitplanmode.
User interaction: askuserquestion, askfollowupquestion.
Skill invocation: skill.

Unlike Environment and Service, Agent is scored on all five signals: Success (0.1), Speed (0.1), Weight (0.2), Relevance (0.2), and Necessity (0.4). Necessity carries the most weight because the most fundamental question about decision quality is whether each interaction should have happened at all. The agent dimension also evaluates all interactions across every category -it judges the agent's choice to invoke environment and service tools, not just agent-category tools.

A low Agent score means the agent is making poor decisions: unnecessary tool calls, bloated requests, irrelevant file reads, or unfocused reasoning.

Multi-Category Interactions

Some interactions span categories. For example, running curl via bash is both an environment interaction (shell command) and a service interaction (network call). Environment tools that target agent-internal paths (like .claude/) are reclassified as agent interactions.

How Interactions Are Evaluated

Environment, Service, and Agent are all scored by analyzing tool interactions in the agent's transcript. Each interaction is classified into a category, then evaluated on a set of signals.

A key design decision in AXIS is separating execution quality from decision quality. When an agent runs npm install and it takes 10 seconds, there are two distinct questions:

Was the execution good? Did npm succeed? Was it fast?
Was the decision good? Was running npm necessary? Was it the right call?

Environment and Service answer the first question. Agent answers the second. This separation ensures scores are actionable: a low Environment score means your tools are failing or slow, while a low Agent score means the agent is making poor choices. Without this separation, a low score would be ambiguous -is the filesystem slow, or is the agent reading too many files?

Interaction Signals

Every tool interaction is evaluated on five signals.

Signal	Method	What It Measures
Success	Judge	Did the interaction complete without errors? Were the results usable?
Speed	Heuristic	How long did the interaction take relative to expectations for its category?
Weight	Judge	Was the tool invocation right-sized? Did the agent request more or less than needed?
Relevance	Judge	Was the tool output relevant and useful for completing the task?
Necessity	Judge	Were the interactions in this category actually needed, or were they avoidable?

Judge signals are evaluated by an LLM that reads the full content of each tool call and its result. Heuristic signals are computed deterministically from measured values like duration, with no LLM involved. Speed is always heuristic-based because duration is an objective measurement that does not benefit from LLM judgment.

Signal Weights by Dimension

Not all signals apply equally to every dimension. Environment and Service only evaluate execution quality, so they use just two signals. Agent evaluates decision quality across all five.

Signal	Environment	Service	Agent
Success	0.7	0.7	0.1
Speed	0.3	0.3	0.1
Weight	-	-	0.2
Relevance	-	-	0.2
Necessity	-	-	0.4

Why these weights?

Environment and Service use only Success (0.7) and Speed (0.3). These dimensions measure how well the tools and services performed, not whether the agent should have called them. Success is weighted higher because a failed command or erroring API is a more severe problem than a slow one. Weight, Relevance, and Necessity are zero because those are questions about agent decision-making, which belongs in the Agent dimension.

Agent uses all five signals, with Necessity carrying the most weight at 0.4. Necessity answers the most fundamental question about decision quality: should this interaction have happened at all? An agent that makes fewer, more targeted interactions is fundamentally better than one that shotguns commands hoping something works. Weight (0.2) and Relevance (0.2) refine the picture -were the calls right-sized, and was the output actually useful? Success and Speed are low (0.1 each) for the Agent dimension because execution outcomes are already captured by Environment and Service.

Speed Thresholds

Speed scores are based on how long each interaction took. Thresholds vary by category because different types of operations have different expected durations. A file read that takes 2 seconds is slow; an API call that takes 2 seconds is normal.

Category	Excellent	Good	Fair	Slow	Very Slow
Environment	≤500ms	≤2s	≤5s	≤10s	>10s
Service	≤2s	≤5s	≤10s	≤25s	>25s
Agent	≤2s	≤5s	≤15s	≤30s	>30s

How Speed Is Measured

Speed scores are computed deterministically from timestamps in the agent's output stream - no LLM judgment is involved. Each agent CLI (Claude Code, Codex, Gemini) emits timestamped transcript entries as it runs. AXIS uses these timestamps to measure how long each interaction took.

Tool interactions

For tool calls, the duration is the time between the tool_use entry and its paired tool_result entry:

durationMs = timestamp(tool_result) − timestamp(tool_use)

Tool pairing is deterministic. AXIS first matches entries by tool ID (when the agent provides one), then falls back to positional pairing -each unpaired tool_use is matched with the next unpaired tool_result in sequence. This means the measured duration includes everything that happens between the agent sending a tool request and receiving the result: SDK roundtrips, sandbox setup, process spawning, and the actual operation.

Agent thinking

Assistant (thinking/reasoning) entries do not have explicit start and end timestamps in the same way tool calls do. Instead, AXIS infers the duration from the gap between the start of the thinking block and the start of the next interaction:

durationMs = startMs(next interaction) − startMs(current thinking block)

Consecutive assistant entries are merged into a single interaction before this calculation. If a thinking block is the last interaction in the transcript, its duration remains unknown.

Missing or invalid timestamps

When timestamps are missing, empty, or invalid (not parseable as ISO 8601), the duration is recorded as unknown. Interactions with unknown durations receive a perfect speed score of 1.0 -AXIS assumes the interaction was fast rather than penalizing for missing data. This means speed scoring is conservative: it only lowers scores when it has concrete evidence of slow performance.

What the duration includes

The measured duration reflects wall-clock time as reported by the agent CLI, which includes:

The actual operation (file read, API call, shell command, etc.)
Agent framework overhead (SDK serialization, sandbox setup)
Process spawning time for shell commands
Network latency for API calls

The speed thresholds above are intentionally generous to account for this overhead. A file read scored as "Excellent" at ≤500ms leaves room for the agent framework to serialize the request, execute the read, and return the result.

Score Calibration

Raw signal scores are mapped to a 0 to 100 scale using a log-normal S-curve rather than linear scaling.

Why an S-curve?

Linear scaling would mean going from 80 to 90 is exactly as hard as going from 20 to 30. That does not match reality. Fixing obvious problems -broken commands, missing files, erroring APIs -is straightforward. But squeezing out the last few points of efficiency, eliminating every unnecessary interaction and optimizing every call, requires real sophistication. The S-curve reflects this: easy gains at the bottom, diminishing returns at the top.

A score of 50 represents median performance for that dimension.
Improving from 20 to 50 is relatively easy (fixing obvious problems).
Improving from 80 to 95 requires significant quality gains.

Speed aggregation

Speed is aggregated using a severity-weighted average rather than a simple mean. Slow interactions pull the score down disproportionately rather than being hidden by many fast interactions. If you have 100 fast file reads and 1 API call that takes 30 seconds, a simple average would bury the problem. Severity weighting ensures that one slow call visibly impacts the score.

Other signals (success, weight, relevance) are weighted by context size, so a failed API call that returns a large error response influences the score more than a trivial file read.

Interpreting Scores

Range	Interpretation
90 to 100	Excellent. Agent completed the task efficiently with minimal waste.
75 to 89	Good. Task completed with minor inefficiencies or missed optimizations.
50 to 74	Fair. Notable issues in execution quality, speed, or unnecessary operations.
Below 50	Poor. Significant failures, errors, or excessive waste in the execution.

When a category score falls below 75, the CLI displays score insights that identify the weakest signal, helping you understand where the agent struggled.