Analytics V2: Deeper Insight. Faster Debugging. Cleaner UI.

July 25th, 2025

We listened. You shipped agents; we shipped better analytics. Our new Analytics V2 experience makes it radically easier to see how your agent actually performs, diagnose failures, and chase human-level (or better) efficiency across the open web.

TL;DR – What's New

Episode Overview dashboard with success, reliability, runtime, step counts, and token usage at a glance
Dual success signals: agent self-report vs independent LLM judge, with automatic mismatch + alignment callouts
Reliability across episodes – estimate how robustly your agent solves a task across repeated runs
Agent success heatmap of outcomes so you can spot fail clusters instantly
Agent Efficiency vs Human Baseline – time + step deltas; see where you beat or lag humans
Full token accounting (input + output) per task
Frame-by-frame agent replay – model reasoning traces, executed actions, and screenshots for every step
Insights tab – task-level score, detailed summary, error highlights, and fix suggestions
Website / Use-case / Action-tag rollups to surface systematic weaknesses
Issue Code tallies across large evals to spot agent failures
Crash + error logs surfaced inline for faster debugging

1. Episode Overview: Fast Read on Each Eval NEW

When you open an eval run you land on a quick-read header showing:

Success rates (agent self-reported + LLM evaluator)
Reliability: how consistently the agent solves each task across 1+ episodes – a robustness signal for your current setup
Total wall-clock time for the eval, plus average per task
Average action steps per task
Total tokens consumed across the run

Use this view to triage: high failure rate? exploding tokens? long runtimes? Jump straight to the problem slice.

Episode Overview dashboard showing AI agent analytics with success rates, reliability metrics, time and steps analysis, and token usage breakdown for browser agent evaluation

2. Success Analysis: Trust But Verify

Agents misreport. We show you when.

For every task and across the full eval we log:

Agent-reported success
Independent LLM evaluator score (continuously improved)
Alignment ratio – how often the agent's claim matches the judge
Automatic mismatch surfacing so you can audit inflated success metrics

This helps you avoid shipping an agent that "thinks" it won but actually failed the user.

Success analysis table showing agent self-reported vs LLM judge scoring for browser agent evaluation and AI benchmarking

3. Agent Reliability Metrics Across Episodes NEW

One clean run doesn't mean you're production-ready. Agent reliability metrics aggregate task outcomes across multiple episodes so you can see which tasks your agent handles consistently vs by luck. Use it to:

Flag flaky behaviors
Estimate expected success in production pipelines
Compare new builds over time

4. Agent Success Heatmap NEW

Scan a colored grid of outcomes: green = success, red = fail, yellow = mismatch—to see every task outcome across every episode in a single glance. It's the quickest way to catch brittle sites and fail clusters.

Reliability variance. Next to each task we display a variance score that shows how steady (or shaky) your agent is. A variance of 0 means the agent was either perfect or hopeless on that task in every run; anything above 0 flags flaky behavior that needs investigation.

AI agent success heatmap showing browser agent evaluation results with color-coded success rates, reliability variance metrics, and LLM debugging tools visualization

5. Agent Efficiency vs Human Baseline Analytics

We maintain human baseline analytics for a growing set of tasks. Analytics V2 overlays your agent's performance against those baselines so you can answer:

Is the agent slower or faster than a human?
Does it thrash (extra steps) even when it eventually succeeds?
Where are the low-hanging optimization wins?

Because many teams care less about raw success than cost (time, tokens, steps), this view helps you tune loops, caching, tool calls, and stop conditions.

We also expose agent token usage per task (input + output) so you can track model spend drivers.

Agent efficiency table comparing browser agent performance against human baseline analytics with token usage tracking

6. Quick-Read Charts NEW

Prefer pictures over tables? We surface compact charts that summarize efficiency gaps and outliers without diving into detailed logs.

Browser automation analytics charts showing steps per task, time per task, token usage, and human alignment scores for AI evaluation platform benchmarking

7. Agent Replay: Screenshots + Reasoning Traces

Deep debugging is now built in. For every task you can replay the full episode step by step:

Model reasoning text
Executed agent actions
Screenshot for each step

No more spelunking in raw JSON or hunting log IDs.

Agent replay viewer showing step-by-step browser agent debugging with reasoning traces, actions, and screenshots

8. Website, Use-case & Action-Tag Breakdowns NEW

Slice web agent performance by:

Top websites in your eval set
Functional buckets (shopping, search, travel, etc.)
Action-oriented tags (login, extraction, research, filtering, comparison, etc.)

These cuts reveal whether your agent is strong at browsing but weak at comparison flows, or nails research tasks but stalls on filtering.

Browser automation analytics showing website performance breakdowns, use-case analysis, and action-tag categorization for web agent evaluation

9. Task Summaries NEW

Each task now comes with:

Final agent response
LLM-generated score
Detailed reasoning summary that explains the score and what went wrong
Targeted improvement tips tied to observed issue codes

Great for fast triage across hundreds or thousands of tasks.

LLM debugging tools showing task summaries with detailed reasoning, scores, and improvement suggestions for AI agent evaluation

10. Issue Codes NEW

We auto-assign proprietary issue codes at the task level (navigation error, bad extraction, wrong form value, timeout, etc., and more). The rollup view shows which problems dominate large evals so teams can prioritize fixes that move the metric.

Issue code analysis showing categorized agent failures and error patterns for systematic debugging and improvement prioritization

11. Crash + Error Logs NEW

If the agent crashes, you see it. Error logs surface alongside the affected tasks so engineers can jump straight to the root cause.

Get Early Access & Integrate Your Agent

Want to see your agent (or model + tool stack) in Analytics V2? We support running evaluations on your private agent or any of the open‑source agents already integrated on the AI evaluation platform. Reach out if you'd like to:

Onboard your agent for private eval
Benchmark head to head against other agents
Collaborate on new task sets or vertical benchmarks

Looking for early access or integration support?

Email us at info@paradigm-shift.ai or book a consultation.

Ready to get started?