REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly
๐ Introduction
Robotic manipulation remains a core challenge in robotics, particularly for contact-rich tasks such as industrial assembly and disassembly. Existing datasets have significantly advanced learning in manipulation but are primarily focused on simpler tasks like object rearrangement, falling short of capturing the complexity and physical dynamics involved in assembly and disassembly. To bridge this gap, we present REASSEMBLE (Robotic assEmbly disASSEMBLy datasEt), a new dataset designed specifically for contact-rich manipulation tasks. Built around the NIST Assembly Task Board 1 benchmark, REASSEMBLE includes four actions (pick, insert, remove, and place) involving 17 objects. The dataset contains 4,551 demonstrations, of which 4,035 were successful, spanning a total of 781 minutes. Our dataset features multi-modal sensor data including event cameras, force-torque sensors, microphones, and multi-view RGB cameras. This diverse dataset supports research in areas such as learning contact-rich manipulation, task condition identification, action segmentation, and more. We believe REASSEMBLE will be a valuable resource for advancing robotic manipulation in complex, real-world scenarios.
โจ Key Features
Multimodality: REASSEMBLE contains data from robot proprioception, RGB cameras, Force&Torque sensors, microphones, and event cameras
Multitask labels: REASSEMBLE contains labeling which enables research in Temporal Action Segmentation, Motion Policy Learning, Anomaly detection, and Task Inversion.
Long horizon: Demonstrations in the REASSEMBLE dataset cover long horizon tasks and actions which usually span multiple steps.
Hierarchical labels: REASSEMBLE contains actions segmentation labels at two hierarchical levels.
๐ด Dataset Collection
Each demonstration starts by randomizing the board and object poses, after which an operator teleoperates the robot to assemble and disassemble the board while narrating their actions and marking task segment boundaries with key presses. The narrated descriptions are transcribed using Whisper [1], and the board and camera poses are measured at the beginning using a motion capture system, though continuous tracking is avoided due to interference with the event camera. Sensory data is recorded with rosbag and later post-processed into HDF5 files without downsampling or synchronization, preserving raw data and timestamps for future flexibility. To reduce memory usage, video and audio are stored as encoded MP4 and MP3 files, respectively. Transcription errors are corrected automatically or manually, and a custom visualization tool is used to validate the synchronization and correctness of all data and annotations. Missing or incorrect entries are identified and corrected, ensuring the datasetโs completeness. Low-level Skill annotations were added manually after data collection,ย and all labels were carefully reviewed to ensure accuracy.
๐ Dataset Structure
The dataset consists of several HDF5 (.h5) and JSON (.json) files, organized into two directories. The poses directory contains the JSON files, which store the poses of the cameras and the board in the world coordinate frame. The data directory contains the HDF5 files, which store the sensory readings and annotations collected as part of the REASSEMBLE dataset. Each JSON file can be matched with its corresponding HDF5 file based on their filenames, which include the timestamp when the data was recorded. For example, 2025-01-09-13-59-54_poses.json corresponds to 2025-01-09-13-59-54.h5.
The structure of the JSON files is as follows:
{"Hama1": [
[x ,y, z],
[qx, qy, qz, qw]
],
"Hama2": [
[x ,y, z],
[qx, qy, qz, qw]
],
"DAVIS346": [
[x ,y, z],
[qx, qy, qz, qw]
],
"NIST_Board1": [
[x ,y, z],
[qx, qy, qz, qw]
]
}
[x, y, z] represent the position of the object, and [qx, qy, qz, qw] represent its orientation as a quaternion.
The HDF5 (.h5) format organizes data into two main types of structures: datasets, which hold the actual data, and groups, which act like folders that can contain datasets or other groups. In the diagram below, groups are shown as folder icons, and datasets as file icons. The main group of the file directly contains the video, audio, and event data. To save memory, video and audio are stored as encoded byte strings, while event data is stored as arrays. The robotโs proprioceptive information is kept in the robot_state group as arrays. Because different sensors record data at different rates, the arrays vary in length (signified by the N_xxx variable in the data shapes). To align the sensory data, each sensorโs timestamps are stored separately in the timestamps group. Information about action segments is stored in the segments_info group. Each segment is saved as a subgroup, named according to its order in the demonstration, and includes a start timestamp, end timestamp, a success indicator, and a natural language description of the action. Within each segment, low-level skills are organized under a low_level subgroup, following the same structure as the high-level annotations.๐ .h5โโโ๐ hama1 - mp4 encoded videoโโโ๐ hama2_audio - mp3 encoded audioโโโ๐ hama2 - mp4 encoded videoโโโ๐ hama2_audio - mp3 encoded audioโโโ๐ hand - mp4 encoded videoโโโ๐ hand_audio - mp3 encoded audioโโโ๐ capture_node - mp4 encoded video (Event camera)โโโ๐ events - N_events x 3 (x, y, polarity)โโโ๐ robot_stateโ ย โโโ๐ compensated_base_force - N_bf x 3 (x, y, z)โ ย โโโ๐ compenseted_base_torque - N_bt x 3 (x, y, z)โ ย โโโ๐ gripper_positions - N_grip x 2 (left, right)โ ย โโโ๐ joint_efforts - N_je x 7 (one for each joint)โ ย โโโ๐ joint_positions - N_jp x 7 (one for each joint)โ ย โโโ๐ joint_velocities - N_jv x 7 (one for each joint)โ ย โโโ๐ measured_force - N_mf x 3 (x, y, z)โ ย โโโ๐ measured_torque - N_mt x 7 (x, y, z)โ ย โโโ๐ pose - N_poses x 7 (x, y, z, qw, qx, qy, qz)โ ย โโโ๐ velocity - N_vels x 7 (x, y, z, ฯ, ฮณ, ฮธ)โโโ๐ timestampsโ ย โโโ๐ hama1 - N_hama1 x 1โ ย โโโ๐ hama2 - N_hama1 x 1โ ย โโโ๐ hand - N_hand x 1โ ย โโโ๐ capture_node - N_capture x 1โ ย โโโ๐ events - N_events x 1โ ย โโโ๐ compensated_base_force - N_bf x 1โ ย โโโ๐ compenseted_base_torque - N_bt x 1โ ย โโโ๐ gripper_positions - N_grip x 1โ ย โโโ๐ joint_efforts - N_je x 1โ ย โโโ๐ joint_positions - N_jp x 1โ ย โโโ๐ joint_velocities - N_jv x 1โ ย โโโ๐ measured_force - N_mf x 1โ ย โโโ๐ measured_torque - N_mt x 1โ ย โโโ๐ pose - N_poses x 1โ ย โโโ๐ velocity - N_vels x 1โโโ๐ segments_infoย ย โโโ๐ 0ย ย โ ย โโโ๐ start - scalarย ย โ ย โโโ๐ end - scalarย ย โ ย โโโ๐ success - Booleanย ย โ ย โโโ๐ text - scalarย ย โ ย โโโ๐ Low_levelย ย โ ย ย ย โโโ๐ 0ย ย โ ย ย ย โ ย โโโ๐ start - scalarย ย โ ย ย ย โ ย โโโ๐ end - scalarย ย โ ย ย ย โ ย โโโ๐ success - Booleanย ย โ ย ย ย โ ย โโโ๐ text - scalarย ย โ ย ย ย โโโ๐ 1ย ย โ ย ย ย ย ย โฎย ย โโโ๐ 1ย ย ย ย โฎ
Theย splits folder contains two text files which list the h5 files used for the traning and validation splits.
๐ Important Resources
The project website contains more details about the REASSEMBLE dataset. The Code for loading and visualizing the data is avaibile on our github repository.
๐ Project website:ย https://tuwien-asl.github.io/REASSEMBLE_page/๐ป Code: https://github.com/TUWIEN-ASL/REASSEMBLE
โ ๏ธ File comments
Below is a table which contains a list records which have any issues. Issues typically correspond to missing data from one of the sensors.
Recording
Issue
2025-01-10-15-28-50.h5
hand cam missing at beginning
2025-01-10-16-17-40.h5
missing hand cam
2025-01-10-17-10-38.h5
hand cam missing at beginning
2025-01-10-17-54-09.h5
no empty action at beginning
2025-01-11-14-22-09.h5
no empty action at beginning
2025-01-11-14-45-48.h5
F/T not valid for last action
2025-01-11-15-27-19.h5
F/T not valid for last action
2025-01-11-15-35-08.h5
F/T not valid for last action
2025-01-13-11-16-17.h5
gripper broke for last action
2025-01-13-11-18-57.h5
pose not available for last action