Dataset overview
This dataset contains source code and annotation guidelines used in the PhD thesis:
“On-Premise Medical Information Extraction from German Doctor’s Letters under Clinical Constraints”
Repository structure
The dataset is split into five repositories:
Source code for Chapter 2.6 De-identification of German doctor’s letters
Source code for Chapter 5 Clinical Section Classification using Pretrained Language Models and Prompting
Source code for Chapter 6 Medication Information Extraction using Local Large Language Models
Source code for Chapter 7Clinical Application: Medication Trends and Polypharmacy
Annotation guidelines for Chapters 2.6, 4, 5, and 7
CARDIO:DE
The main dataset used for experiments in Chapters 5, 6, and 7:
CARDIO:DE -
https://doi.org/10.11588/DATA/AFYQDY
Additional datasets (not included here)
Other datasets used include:
n2c2 2018 Track 2 (used in Chapter 6) -
https://doi.org/10.1093/jamia/ocz166
Notes on additional data and model availability
Doctor’s letters from the cardiology domain used in Chapters 2, 5, 6, and 7 (except for CARDIO:DE) and all further-pretrained and finetuned models cannot be distributed due to data protection regulations.