Data Sets

Specific Datasets require separate Data Use Agreements in addition to the Membership Agreement. For 2017 Membership Year, these datasets are ShARe (requires a Data Use Agreement with MIMIC/Physionet initiative) and THYME (requires a Data Use Agreement with Mayo Clinic). The Data Use Agreements are required to obtain the text files; obtaining the stand alone gold annotations does not require Data Use Agreements. The Center staff will guide each member candidate through the Data Use Agreement process although relinquishes itself from guarantees of the outcome.

When using a data set, please cite the associated papers and acknowledge the hNLP Center.


2017 Data set Annotation formats Documentation
CCHMC ICD-9 radiology corpus Text A Shared Task Involving Multi-label Classification of Clinical Free Text
ShARe disorders corpus Knowtator
SemEval-2015 Task 14: Analysis of Clinical Text
THYME corpus Anafora (1) Temporal Annotation in the Clinical Domain
(2) SemEval-2016 Task 12: Clinical TempEval