Analytics V2: Deeper Insight. Faster Debugging. Cleaner UI.
We listened. You shipped agents; we shipped better analytics. Our new Analytics V2 experience makes it radically easier to see how your agent actually performs, diagnose failures, and chase human-level (or better) efficiency across the open web.
TL;DR – What's New
Episode Overview dashboard with success, reliability, runtime, step counts, and token usage at a glance
Dual success signals: agent self-report vs independent LLM judge, with automatic mismatch + alignment callouts
Reliability across episodes – estimate how robustly your agent solves a task across repeated runs
Agent success heatmap of outcomes so you can spot fail clusters instantly
Agent Efficiency vs Human Baseline – time + step deltas; see where you beat or lag humans
Full token accounting (input + output) per task
Frame-by-frame agent replay – model reasoning traces, executed actions, and screenshots for every step
Insights tab – task-level score, detailed summary, error highlights, and fix suggestions
Website / Use-case / Action-tag rollups to surface systematic weaknesses
Issue Code tallies across large evals to spot agent failures
Crash + error logs surfaced inline for faster debugging
1. Episode Overview: Fast Read on Each Eval NEW
When you open an eval run you land on a quick-read header showing:
- Success rates (agent self-reported + LLM evaluator)
- Reliability: how consistently the agent solves each task across 1+ episodes – a robustness signal for your current setup
- Total wall-clock time for the eval, plus average per task
- Average action steps per task
- Total tokens consumed across the run
Use this view to triage: high failure rate? exploding tokens? long runtimes? Jump straight to the problem slice.

2. Success Analysis: Trust But Verify
Agents misreport. We show you when.
For every task and across the full eval we log:
- Agent-reported success
- Independent LLM evaluator score (continuously improved)
- Alignment ratio – how often the agent's claim matches the judge
- Automatic mismatch surfacing so you can audit inflated success metrics
This helps you avoid shipping an agent that "thinks" it won but actually failed the user.
Success analysis table showing agent self-reported vs LLM judge scoring for browser agent evaluation and AI benchmarking
3. Agent Reliability Metrics Across Episodes NEW
One clean run doesn't mean you're production-ready. Agent reliability metrics aggregate task outcomes across multiple episodes so you can see which tasks your agent handles consistently vs by luck. Use it to:
- Flag flaky behaviors
- Estimate expected success in production pipelines
- Compare new builds over time
4. Agent Success Heatmap NEW
Scan a colored grid of outcomes: green = success, red = fail, yellow = mismatch—to see every task outcome across every episode in a single glance. It's the quickest way to catch brittle sites and fail clusters.
Reliability variance. Next to each task we display a variance score that shows how steady (or shaky) your agent is. A variance of 0 means the agent was either perfect or hopeless on that task in every run; anything above 0 flags flaky behavior that needs investigation.

5. Agent Efficiency vs Human Baseline Analytics
We maintain human baseline analytics for a growing set of tasks. Analytics V2 overlays your agent's performance against those baselines so you can answer:
- Is the agent slower or faster than a human?
- Does it thrash (extra steps) even when it eventually succeeds?
- Where are the low-hanging optimization wins?
Because many teams care less about raw success than cost (time, tokens, steps), this view helps you tune loops, caching, tool calls, and stop conditions.
We also expose agent token usage per task (input + output) so you can track model spend drivers.
Agent efficiency table comparing browser agent performance against human baseline analytics with token usage tracking
6. Quick-Read Charts NEW
Prefer pictures over tables? We surface compact charts that summarize efficiency gaps and outliers without diving into detailed logs.

7. Agent Replay: Screenshots + Reasoning Traces
Deep debugging is now built in. For every task you can replay the full episode step by step:
- Model reasoning text
- Executed agent actions
- Screenshot for each step
No more spelunking in raw JSON or hunting log IDs.
Agent replay viewer showing step-by-step browser agent debugging with reasoning traces, actions, and screenshots
8. Website, Use-case & Action-Tag Breakdowns NEW
Slice web agent performance by:
- Top websites in your eval set
- Functional buckets (shopping, search, travel, etc.)
- Action-oriented tags (login, extraction, research, filtering, comparison, etc.)
These cuts reveal whether your agent is strong at browsing but weak at comparison flows, or nails research tasks but stalls on filtering.
Browser automation analytics showing website performance breakdowns, use-case analysis, and action-tag categorization for web agent evaluation
9. Task Summaries NEW
Each task now comes with:
- Final agent response
- LLM-generated score
- Detailed reasoning summary that explains the score and what went wrong
- Targeted improvement tips tied to observed issue codes
Great for fast triage across hundreds or thousands of tasks.
LLM debugging tools showing task summaries with detailed reasoning, scores, and improvement suggestions for AI agent evaluation
10. Issue Codes NEW
We auto-assign proprietary issue codes at the task level (navigation error, bad extraction, wrong form value, timeout, etc., and more). The rollup view shows which problems dominate large evals so teams can prioritize fixes that move the metric.
Issue code analysis showing categorized agent failures and error patterns for systematic debugging and improvement prioritization
11. Crash + Error Logs NEW
If the agent crashes, you see it. Error logs surface alongside the affected tasks so engineers can jump straight to the root cause.
Get Early Access & Integrate Your Agent
Want to see your agent (or model + tool stack) in Analytics V2? We support running evaluations on your private agent or any of the open‑source agents already integrated on the AI evaluation platform. Reach out if you'd like to:
- Onboard your agent for private eval
- Benchmark head to head against other agents
- Collaborate on new task sets or vertical benchmarks
Looking for early access or integration support?
Email us at info@paradigm-shift.ai or book a consultation.