REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly

DOI

REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly

๐Ÿ“‹ Introduction

Robotic manipulation remains a core challenge in robotics, particularly for contact-rich tasks such as industrial assembly and disassembly. Existing datasets have significantly advanced learning in manipulation but are primarily focused on simpler tasks like object rearrangement, falling short of capturing the complexity and physical dynamics involved in assembly and disassembly. To bridge this gap, we present REASSEMBLE (Robotic assEmbly disASSEMBLy datasEt), a new dataset designed specifically for contact-rich manipulation tasks. Built around the NIST Assembly Task Board 1 benchmark, REASSEMBLE includes four actions (pick, insert, remove, and place) involving 17 objects. The dataset contains 4,551 demonstrations, of which 4,035 were successful, spanning a total of 781 minutes. Our dataset features multi-modal sensor data including event cameras, force-torque sensors, microphones, and multi-view RGB cameras. This diverse dataset supports research in areas such as learning contact-rich manipulation, task condition identification, action segmentation, and more. We believe REASSEMBLE will be a valuable resource for advancing robotic manipulation in complex, real-world scenarios.

โœจ Key Features

Multimodality: REASSEMBLE contains data from robot proprioception, RGB cameras, Force&Torque sensors, microphones, and event cameras

Multitask labels: REASSEMBLE contains labeling which enables research in Temporal Action Segmentation, Motion Policy Learning, Anomaly detection, and Task Inversion.

Long horizon: Demonstrations in the REASSEMBLE dataset cover long horizon tasks and actions which usually span multiple steps.

Hierarchical labels: REASSEMBLE contains actions segmentation labels at two hierarchical levels.

๐Ÿ”ด Dataset Collection

Each demonstration starts by randomizing the board and object poses, after which an operator teleoperates the robot to assemble and disassemble the board while narrating their actions and marking task segment boundaries with key presses. The narrated descriptions are transcribed using Whisper [1], and the board and camera poses are measured at the beginning using a motion capture system, though continuous tracking is avoided due to interference with the event camera. Sensory data is recorded with rosbag and later post-processed into HDF5 files without downsampling or synchronization, preserving raw data and timestamps for future flexibility. To reduce memory usage, video and audio are stored as encoded MP4 and MP3 files, respectively. Transcription errors are corrected automatically or manually, and a custom visualization tool is used to validate the synchronization and correctness of all data and annotations. Missing or incorrect entries are identified and corrected, ensuring the datasetโ€™s completeness. Low-level Skill annotations were added manually after data collection,ย  and all labels were carefully reviewed to ensure accuracy.

๐Ÿ“‘ Dataset Structure

The dataset consists of several HDF5 (.h5) and JSON (.json) files, organized into two directories. The poses directory contains the JSON files, which store the poses of the cameras and the board in the world coordinate frame. The data directory contains the HDF5 files, which store the sensory readings and annotations collected as part of the REASSEMBLE dataset. Each JSON file can be matched with its corresponding HDF5 file based on their filenames, which include the timestamp when the data was recorded. For example, 2025-01-09-13-59-54_poses.json corresponds to 2025-01-09-13-59-54.h5.

The structure of the JSON files is as follows:

{"Hama1": [ [x ,y, z], [qx, qy, qz, qw] ], "Hama2": [ [x ,y, z], [qx, qy, qz, qw] ], "DAVIS346": [ [x ,y, z], [qx, qy, qz, qw] ], "NIST_Board1": [ [x ,y, z], [qx, qy, qz, qw] ] }

[x, y, z] represent the position of the object, and [qx, qy, qz, qw] represent its orientation as a quaternion.

The HDF5 (.h5) format organizes data into two main types of structures: datasets, which hold the actual data, and groups, which act like folders that can contain datasets or other groups. In the diagram below, groups are shown as folder icons, and datasets as file icons. The main group of the file directly contains the video, audio, and event data. To save memory, video and audio are stored as encoded byte strings, while event data is stored as arrays. The robotโ€™s proprioceptive information is kept in the robot_state group as arrays. Because different sensors record data at different rates, the arrays vary in length (signified by the N_xxx variable in the data shapes). To align the sensory data, each sensorโ€™s timestamps are stored separately in the timestamps group. Information about action segments is stored in the segments_info group. Each segment is saved as a subgroup, named according to its order in the demonstration, and includes a start timestamp, end timestamp, a success indicator, and a natural language description of the action. Within each segment, low-level skills are organized under a low_level subgroup, following the same structure as the high-level annotations.๐Ÿ“ .h5โ”œโ”€โ”€๐Ÿ“„ hama1 - mp4 encoded videoโ”œโ”€โ”€๐Ÿ“„ hama2_audio - mp3 encoded audioโ”œโ”€โ”€๐Ÿ“„ hama2 - mp4 encoded videoโ”œโ”€โ”€๐Ÿ“„ hama2_audio - mp3 encoded audioโ”œโ”€โ”€๐Ÿ“„ hand - mp4 encoded videoโ”œโ”€โ”€๐Ÿ“„ hand_audio - mp3 encoded audioโ”œโ”€โ”€๐Ÿ“„ capture_node - mp4 encoded video (Event camera)โ”œโ”€โ”€๐Ÿ“„ events - N_events x 3 (x, y, polarity)โ”œโ”€โ”€๐Ÿ“ robot_stateโ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ compensated_base_force - N_bf x 3 (x, y, z)โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ compenseted_base_torque - N_bt x 3 (x, y, z)โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ gripper_positions - N_grip x 2 (left, right)โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ joint_efforts - N_je x 7 (one for each joint)โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ joint_positions - N_jp x 7 (one for each joint)โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ joint_velocities - N_jv x 7 (one for each joint)โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ measured_force - N_mf x 3 (x, y, z)โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ measured_torque - N_mt x 7 (x, y, z)โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ pose - N_poses x 7 (x, y, z, qw, qx, qy, qz)โ”‚ ย  โ””โ”€โ”€๐Ÿ“„ velocity - N_vels x 7 (x, y, z, ฯ‰, ฮณ, ฮธ)โ”œโ”€โ”€๐Ÿ“ timestampsโ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ hama1 - N_hama1 x 1โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ hama2 - N_hama1 x 1โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ hand - N_hand x 1โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ capture_node - N_capture x 1โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ events - N_events x 1โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ compensated_base_force - N_bf x 1โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ compenseted_base_torque - N_bt x 1โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ gripper_positions - N_grip x 1โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ joint_efforts - N_je x 1โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ joint_positions - N_jp x 1โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ joint_velocities - N_jv x 1โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ measured_force - N_mf x 1โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ measured_torque - N_mt x 1โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ pose - N_poses x 1โ”‚ ย  โ””โ”€โ”€๐Ÿ“„ velocity - N_vels x 1โ””โ”€โ”€๐Ÿ“ segments_infoย  ย  โ”œโ”€โ”€๐Ÿ“ 0ย  ย  โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ start - scalarย  ย  โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ end - scalarย  ย  โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ success - Booleanย  ย  โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ text - scalarย  ย  โ”‚ ย  โ””โ”€โ”€๐Ÿ“ Low_levelย  ย  โ”‚ ย  ย  ย  โ”œโ”€โ”€๐Ÿ“ 0ย  ย  โ”‚ ย  ย  ย  โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ start - scalarย  ย  โ”‚ ย  ย  ย  โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ end - scalarย  ย  โ”‚ ย  ย  ย  โ”‚ ย  โ”œโ”€โ”€๐Ÿ“„ success - Booleanย  ย  โ”‚ ย  ย  ย  โ”‚ ย  โ””โ”€โ”€๐Ÿ“„ text - scalarย  ย  โ”‚ ย  ย  ย  โ””โ”€โ”€๐Ÿ“ 1ย  ย  โ”‚ ย  ย  ย  ย  ย  โ‹ฎย  ย  โ””โ”€โ”€๐Ÿ“ 1ย  ย  ย  ย  โ‹ฎ

Theย splits folder contains two text files which list the h5 files used for the traning and validation splits.

๐Ÿ“Œ Important Resources

The project website contains more details about the REASSEMBLE dataset. The Code for loading and visualizing the data is avaibile on our github repository.

๐Ÿ“„ Project website:ย https://tuwien-asl.github.io/REASSEMBLE_page/๐Ÿ’ป Code: https://github.com/TUWIEN-ASL/REASSEMBLE

โš ๏ธ File comments

Below is a table which contains a list records which have any issues. Issues typically correspond to missing data from one of the sensors.

Recording Issue

2025-01-10-15-28-50.h5 hand cam missing at beginning

2025-01-10-16-17-40.h5 missing hand cam

2025-01-10-17-10-38.h5 hand cam missing at beginning

2025-01-10-17-54-09.h5 no empty action at beginning

2025-01-11-14-22-09.h5 no empty action at beginning

2025-01-11-14-45-48.h5 F/T not valid for last action

2025-01-11-15-27-19.h5 F/T not valid for last action

2025-01-11-15-35-08.h5 F/T not valid for last action

2025-01-13-11-16-17.h5 gripper broke for last action

2025-01-13-11-18-57.h5 pose not available for last action

Identifier
DOI https://doi.org/10.48436/0ewrv-8cb44
Related Identifier Cites https://doi.org/10.15607/RSS.2024.XX.120
Related Identifier Cites https://doi.org/10.1109/ICRA57147.2024.10611615
Related Identifier Cites https://doi.org/10.1177/02783649241304789
Related Identifier Cites https://doi.org/10.1109/LRA.2024.3520916
Related Identifier IsVersionOf https://doi.org/10.48436/sn234-58p90
Metadata Access https://researchdata.tuwien.ac.at/oai2d?verb=GetRecord&metadataPrefix=oai_datacite&identifier=oai:researchdata.tuwien.ac.at:0ewrv-8cb44
Provenance
Creator Sliwowski, Daniel Jan; Jadav, Shail; Stanovcic, Sergej; Orbik, Jฤ™drzej; Heidersberger, Johannes; Lee, Dongheui
Publisher TU Wien
Publication Year 2025
Funding Reference European Union 019w4f821 ROR 101136067 INteractive robots that intuitiVely lEarn to inVErt tasks by ReaSoning about their Execution (INVERSE); Ministry of Trade, Industry and Energy (MOTIE) 008nkqk13 ROR 00416440 Robot Industry Core Technology Development Program
Rights Creative Commons Attribution 4.0 International; https://creativecommons.org/licenses/by/4.0/legalcode
OpenAccess true
Contact tudata(at)tuwien.ac.at
Representation
Language English
Resource Type Dataset
Version 1.0.0
Discipline Other