Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation

Abstract

Robot manipulation learning from human demonstrations offers a rapid means to acquire skills but often lacks generalization across diverse scenes and object placements. This limitation hinders real-world applications, particularly in complex tasks requiring dexterous manipulation. Vision-Language-Action (VLA) paradigm leverages large-scale data to enhance generalization. However, due to data scarcity, VLA’s performance remains limited. In this work, we introduce Object-Focus Actor (OFA), a novel, data-efficient approach for generalized dexterous manipulation. OFA exploits the consistent end trajectories observed in dexterous manipulation tasks, allowing for efficient policy training. Our method employs a hierarchical pipeline: object perception and pose estimation, pre-manipulation pose arrival and OFA policy execution. This process ensures that the manipulation is focused and efficient, even in varied backgrounds and positional layout. Comprehensive real-world experiments across seven tasks demonstrate that OFA significantly outperforms baseline methods in both positional and background generalization tests. Notably, OFA achieves robust performance with only 10 demonstrations, highlighting its data efficiency.

Method

Figure 1: The overall structure of the proposed OFA mainly consists of the following three modules: 1) manipulating-object perception and pose estimation, 2) pre-manipulation pose arrival, 3) object-focus policy learning.

Experiments

Task	ACT	OFA w/o rel-of	OFA w/o rel	OFA w/o of	OFA (object-mask)	OFA (hand-focus)
Grasp Cup	20	40	30	90	50	90
Take Mug	10	30	20	40	10	60
Hold Scanner	30	50	30	90	80	90
Catch Loopy	40	40	70	90	90	80
Pinch Toy	20	40	10	30	10	40
Grasp Sanitizer	30	70	50	80	100	100
Lift Tray	10	90	60	60	50	100

Table 1: Success rate (%) of the comparison methods using 30 human demonstrations. The results are obtained with 10 evaluations.

Position Generalization Experiments

Catch Loopy (Ours)

Task description: Testing OFA's positional generalization at 3 OOD positions in the Catch Loopy task.

Catch Loopy (ACT)

Task description: Testing ACT's positional generalization at 3 OOD positions in the Catch Loopy task.

Hold Scanner (Ours)

Task description: Testing OFA's positional generalization at 3 OOD positions in the Hold Scanner task.

Hold Scanner (ACT)

Task description: Testing ACT's positional generalization at 3 OOD positions in the Hold Scanner task.

Background Generalization Experiments

Catch Loopy (Ours)

Task description: Testing OFA's background generalization with 3 different levels in Catch Loopy task.

Catch Loopy (ACT)

Task description: Testing ACT's background generalization with 3 different levels in Catch Loopy task.

Hold Scanner (Ours)

Task description: Testing OFA's background generalization with 3 different levels in Hold Scanner task.

Hold Scanner (ACT)

Task description: Testing ACT's background generalization with 3 different levels in Hold Scanner task.