\(TERM-Bench\)
Trustworthy Evaluation of Robotic Manipulation: A New Benchmark and AutoEval Methods

Motivation

Motivation: Binary vs Comprehensive Evaluation

Figure 1. The Crisis of Evaluation Credibility and the Proposed Trustworthy Evaluation Solution. (Top) The Crisis: We identify two critical sources of ambiguity obstructing trustworthy evaluation: Gap 1 (Ambiguity in Execution Quality), where binary metrics mask shaky or unsafe execution (visualized as ''Jerky Success'' vs. ''Smooth Success''), and Gap 2 (Ambiguity in Source Authenticity), where the provenance of ''successful'' demonstrations is unverifiable. (Bottom) Proposed Solution: Our Trustworthy Evaluation Framework bridges these gaps. Powered by the Eval-Actions Benchmark and the AutoEval Architecture (depicted as the green robot optimized via Supervised Fine-Tuning (SFT)), the system achieves precise Fine-Grained Action Quality assessment (SRCC 0.84) and robust Source Authenticity verification (99.6%), as shown in the green box. This significantly outperforms standard Vision-Language Models (VLMs) without SFT (red box) to ensure evaluation credibility.

Abstract

Driven by the rapid evolution of Vision-Action (VA) and Vision-Language-Action (VLA) models, imitation learning has significantly advanced robotic manipulation capabilities. However, evaluation methodologies have lagged behind, hindering the establishment of Trustworthy Evaluation for these behaviors. Current paradigms rely predominantly on binary success rates, failing to address the critical dimensions of trust: Source Authenticity (i.e., distinguishing genuine policy behaviors from human teleoperation) and Execution Quality (e.g., smoothness and safety). To bridge these gaps, we propose a comprehensive solution that combines the Eval-Actions benchmark and the AutoEval architecture. First, we construct the Eval-Actions benchmark to support trustworthiness analysis. Distinct from existing datasets restricted to successful human demonstrations, Eval-Actions innovatively integrates VA and VLA policy execution trajectories alongside human teleoperation data, explicitly including failure scenarios. This dataset is structured around three core supervision signals: Expert Grading (EG), Rank-Guided preferences (RG), and Chain-of-Thought (CoT). Building on this, we propose the AutoEval architecture: AutoEval Small (AutoEval-S) leverages Spatio-Temporal Aggregation for semantic assessment, augmented by an auxiliary Kinematic Calibration Signal to refine motion smoothness; AutoEval Plus (AutoEval-P) incorporates the Group Relative Policy Optimization (GRPO) paradigm to enhance logical reasoning capabilities. Experimental results demonstrate that AutoEval exhibits exceptional evaluation precision, achieving Spearman’s Rank Correlation Coefficients (SRCC) of 0.81 and 0.84 under the EG and RG protocols, respectively. Crucially, the framework possesses robust source discrimination capabilities, distinguishing between policy-generated and teleoperated videos with 99.6% accuracy, thereby establishing a rigorous standard for trustworthy robotic evaluation.

Benchmark Overview

Data Modalities

Figure 2. Overview of the Eval-Actions Benchmark. The figure visualizes the dataset structure: (Left) Task Diversity: Representative snapshots from our 150+ scenarios, covering both single-arm interactions (e.g., "Throw away trash") and complex bimanual coordination (e.g., "Tidy medicine box"). (Middle) Detailed Case Study: A specific instantiation of the “Throw away trash" task shown on the left. Crucially, each task encompasses diverse demonstration data ranging from high-quality successes to failure scenarios—exemplified here by contrasting smooth teleoperation with jerky policy behaviors. (Right) Data Composition: The stack enumerates the dense multimodal signals encapsulated within each episode. This includes raw sensory data (RGB, Depth), precise kinematic records (7/14-DoF Joint Trajectories), and the Fine-Grained Quality Radar Chart, which explicitly quantifies the four core dimensions (Success, Smoothness, Safety, Efficiency) to enable diagnostic assessment.

The AutoEval Framework

AutoEval Framework

Figure 3. Overview of the proposed AutoEval framework. The system processes a robot manipulation video sequence (e.g., 32 frames) alongside kinematic prompts. Top (AutoEval-S). Designed for Expert Grading and Rank-Guided tasks, this branch employs a Spatio-Temporal Aggregation Strategy to compress high-frequency motion details into composite visual tokens. It generates structured text predictions; following format decomposition, the model is optimized via Supervised Fine-Tuning (SFT) using Cross-Entropy Loss. Bottom (AutoEval-P). Tailored for Chain-of-Thought (CoT) reasoning, this branch adopts the Group Relative Policy Optimization (GRPO) paradigm. The policy model generates multiple reasoning paths (containing <think> tokens), optimized against a hybrid reward function comprising content accuracy (rContent) and format constraints (rFormat) to enhance physical reasoning capabilities.

Dataset Distribution

Distribution of Grades

Figure 4. Distribution of Expert Grading across the Eval-Actions Small subset.

The Eval-Actions Benchmark

The Eval-Actions Benchmark

Table 1. Comparison of Robotic Manipulation Datasets. Unlike training-centric datasets that maximize raw trajectory counts, Eval-Actions (Ours) maximizes annotation density, uniquely providing failure scenarios, hybrid trajectory sources, and Fine-Grained Quality Scoring for diagnostic assessment.

Quantitative Results

Quantitative Results

Table 2. Comparative Performance Analysis on the Eval-Actions Benchmark. We report results across three protocols: Expert Grading (EG), Rank-Guided (RG), and Chain-of-Thought (CoT). To quantify the domain gap, we include the zero-shot performance of representative VLMs without Supervised Fine-Tuning (w/o SFT). The near-zero correlations (e.g., SRCC ≈ 0.02) in these baselines highlight the necessity of our fine-tuning pipeline. Best results are highlighted in bold.

Qualitative Scoring Analysis

Examples of AutoEval scoring with Chain-of-Thought (CoT) reasoning.

Low Quality (Score: 2.0)

[CoT Analysis]
- Motion: Low average velocity and high joint velocity variance. Movements are hesitant and non-fluid.
- Result: Task failed. Only one bowl placed; third bowl untouched.

Final Score: 2.0
Success: False
Source: Policy

Medium Quality (Score: 5.0)

[CoT Analysis]
- Motion: Deliberate and smooth. Low velocity variance indicates controlled, non-jerky movements.
- Result: Task completed. Both bowls stacked without instability.

Final Score: 5.0
Success: True
Source: Policy

High Quality (Score: 9.0)

[CoT Analysis]
- Motion: Moderate velocity with optimal control. No unnecessary pauses or jitters.
- Result: Flawless execution. Objects placed safely with high efficiency.

Final Score: 9.0
Success: True
Source: Policy

Authenticity Verification Analysis

Distinguishing between Human Teleoperation and Policy Execution based on kinematic patterns.

Human Teleoperation

[CoT Analysis]
- Kinematics: Motion exhibits natural variability and micro-adjustments typical of human biological control.
- Smoothness: High stability but with characteristic human reaction delays.

Predicted: Human Teleoperation
Ground Truth: Human

Policy Execution

[CoT Analysis]
- Kinematics: Motion is mathematically consistent with computed trajectories. Low-variance joint movements.
- Pattern: Deliberate, non-reactive control signatures detected.

Predicted: Learned Policy
Ground Truth: Policy

Policy Performance Ranking

Policy Average Score Rank
\(\pi_0\) 4.4702 1
ACT 3.6959 2
RDT 2.4304 3