The CystoFold dataset is the first hand-labeled dataset for tissue fold formation
prediction in urinary bladder endoscopy. It provides sparse pixel-wise polygon and rectangle annotations capturing
the transition from smooth to folded bladder tissue across filling and emptying cycles, designed to
support deformation-aware reconstruction pipelines and SLAM-based surgical navigation systems.
A total of 873 annotated frames were selected from cystoscopic video recordings of
three patients undergoing routine cystoscopy. Annotations were produced using a custom
backward-tracking methodology: fold regions are first identified in the emptied
(folded) bladder state, then tracked back to their appearance in the distended state, with a pinned
reference view guiding the annotator. All annotations were created using a custom polygon-based GUI tool.
Important: Sparse Annotation Strategy
To ensure maximum ground-truth reliability, annotations cover only a small, localized area
of the visible tissue. Annotators were instructed to label only regions where the tissue state (Stable,
Pre-Fold, or Folded) could be identified with absolute certainty. This "safe-labeling" approach prevents
ambiguous tissue from being misclassified.
Semantic Classes
Stable: Tissue that remains flat and visible throughout the emptying cycle
Pre-Fold: Tissue appearing flat in the current frame that will fold before the cycle ends
Folded: Tissue currently forming a visible fold
Cycle Markers
Each video includes explicit markers for the bladder filling cycle. The marker filled
indicates the frame at which the bladder reached a fully filled state; the marker emptied
indicates the frame at which it was fully emptied. These markers are provided in the
_cycles.csv files and allow temporal alignment of annotations with the physiological
cycle phase. The per-frame filling context (Filling or Emptying) is additionally
stored in the context field of each annotation record in the JSON and CSV label files.
Dataset Structure
patient_1/ (2 filling cycles, 586 usable frames)
<video>.mp4 (Original unlabeled video)
<video>.csv (Per-polygon annotation records)
<video>_cycles.csv (Cycle phase marker timestamps)
<video>.json (Full labeler session, restorable)
patient_2/ (1 filling cycle, 103 usable frames)
...
patient_3/ (1 filling cycle, 184 usable frames)
...
Usage
Each annotation record in the .csv file corresponds to a single labeled region in one frame.
The relevant columns are:
frame — video frame index (0-based)
label — class name: Stable (1), Pre-Fold (2), Folded (3); Background is implicitly 0
shape — annotation geometry: polygon (precise boundary, polygon_points field contains a JSON array of [x, y] pairs) or rect (bounding box, use x1, y1, x2, y2 columns)
use_frame — set to False for frames that should be excluded; filter to use_frame != False before training
frame_quality — exclude frames marked motion_blur
context — Filling or Emptying, indicating the bladder phase at that frame
To render a segmentation mask for a given frame, iterate over all rows matching that frame index,
draw each polygon or rectangle onto a blank canvas using the class index as fill value, and apply
the result as a per-pixel label map.