Welcome to the ShARe Project
Welcome to the Shared Annotated Resources (ShARe) project.
Much of the clinical information required for accurate clinical research, active decision support, and broad-coverage surveillance is locked in text files in an electronic medical record (EMR). The only feasible way to leverage this information for translational science is to extract and encode the information using natural language processing. Over the last two decades, several research groups have developed NLP tools for clinical notes, but a major bottleneck preventing progress in clinical NLP is the lack of standard, annotated data sets for training and evaluating NLP applications. Without these standards, individual NLP applications abound without the ability to train different algorithms on standard annotations, share and integrate NLP modules, or compare performance. We propose to develop standards and infrastructure that can enable technology to extract scientific information from textual medical records, and we propose the research as a collaborative effort involving NLP experts across the U.S.
To accomplish this goal, we will address three specific aims each with a set of sub-aims:
Aim 1: Extend existing standards and develop a new consensus annotation schema for annotating clinical text in a way that is interoperable, extensible and usable
- Develop annotation schemas for the linguistic and clinical annotations
- Determine the reliance on clinical terminologies and ontological knowledge
- Develop annotation guidelines for the linguistic and clinical annotations
Aim 2: Develop and evaluate a manual annotation methodology that is efficient and accurate then apply the methodology to annotate a set of publicly available clinical texts
- Establish an infrastructure for collecting annotations of clinical text
- Develop an Efficient Methodology for Acquiring Accurate Annotations
- Annotate and Evaluate the Final Annotation Set.
Aim 3: Develop a publicly available toolkit for automatically annotating clinical text and perform a shared evaluation to evaluate the toolkit, using evaluation metrics that are multidimensional and flexible
- Incorporate modules in Apache cTAKES using the Mayo NLP System
- Design evaluation metrics for comparing automated annotations against the annotated corpus. Apply standard evaluation methods and develop new evaluation metrics for addressing complexities in evaluation from textual judgments, including no true gold standard and ways to compare frame-based annotations
- Organize a multi-track shared evaluation of clinical NLP systems
- Dissemination plan
The project described is supported by Grant Number R01GM090187 from the National Institute of General Medical Sciences. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences. The project period is September, 2010 - June, 2014.
Who We Are
Boston Children's Hospital/Harvard Medical School
University of Utah
University of Colorado
University of California San Diego
In collaboration with
Publications and Presentations Crediting ShARe
- Meystre, Stephane; Boonsirisumpun, Narong; Elhadad, Noemie; Savova, Guergana; Chapman, Wendy. 2014. Poster: Standards-based data model for clinical documents and information in the Shared Annotated Resources (ShARe) project. AMIA Summit on Clinical Research Informatics, San Francisco, CA.
- Mowery, Danielle L; Franc, Daniel; Ashfaq, Shazia; Zamora, Tania; Cheng, Eric; Chapman, Wendy W; Chapman, Brian E. (2014). Developing a Knowledge Base for Detecting Carotid Stenosis with pyConText. AMIA Symp Proc.
- Pradhan, Sameer; Elhadad, Noemie; South, Brett; Martinez, David; Christensen, Lee; Vogel, Amy; Suominen, Hanna; Chapman, Wendy; Savova, Guergana. (2014). Evaluating the state of the art in disorder recognition and normalization of the clinical narrative. J Am Med Inform Assoc. 
- Savova, Guergana; Pradhan, Sameer; Palmer, Martha; Styler, Will; Chapman, Wendy; Elhadad, Noemie. (in press). Annotating the clinical text - MiPACQ, ShARe, SHARPn and THYME corpora. In Handbook of Linguistic Annotations. Ed. James Pustejovsky and Nancy Ide. Springer.
- South, Brett R; Mowery, Danielle L; Suo, Ying; Ferrández, Oscar; Meystre, Stephane M; Chapman, Wendy W. Evaluating the effects of machine pre-annotation and an interactive annotation interface on manual de-identification of clinical text. (2014).J Biomed Inform. Special Issue: Medical Privacy. 
- South, Brett R; Mowery, Danielle L; Leng, Jianwei; Meystre, Stephane M; Chapman, Wendy W. (2014) A System Usability Study Assessing a Machine-Assisted Interactive Interface to Support Annotation of Protected Health Information in Clinical Texts. AMIA Symp Proc.
- Velupillai, Sumithra; Mowery, Danielle L; Christensen, Lee; Elhadad, Noemie; Pradhan, Sameer; Savova, Guergana; Chapman, Wendy W. Disease/Disorder Semantic Template Filling – Information Extraction Challenge in the ShARe/CLEF eHealth Evaluation Lab 2014. AMIA Symp Proc. 2014
- Chapman, Wendy; Denny, Joshua; Haug, Peter; Meystre, Stephane; Patrick, Jon; Savova, Guergana; Solti, Imre; Uzuner, Ozlem; Xu, Hua. 2013. Panel: Natural language processing working group pre-symposium. AMIA Fall Annual Symposium.
- Dligach, Dmitriy; Bethard, Steven; Becker, Lee; Miller, Timothy; Savova, Guergana. 2013. Discovering body site and severity modifiers in clinical texts. Journal of the American Medical Informatics Association. Doi:10.1136/amiajnl-2013-001766. 
- Dligach, Dmitriy; Bethard, Steven; Becker, Lee; Miller, Timothy; Savova, Guergana. 2013. Discovering body site and severity modifiers in clinical texts. AMIA Fall Annual Symposium.
- Mowery, Danielle L; South, Brett R; Murtola, Laura-Maria; Salanterä, Sanna; Martinez David; Suominen, Hanna; Elhadad, Noemie; Pradhan, Sameer; Savova Guergana; Chapman, Wendy W. Task 2: ShARe/CLEF eHealth evaluation lab 2013. CLEF Proc. Valencia, Spain. 2013. 
- Mowery, Danielle L; South, Brett R; Leng, Jianwei; Murtola, Laura-Maria; Danielsson-Ojala, Rita; Salanterä, Sanna; Chapman, Wendy W. Creating a reference standard of acronym and abbreviation annotations for the ShARe/CLEF eHealth challenge 2013. AMIA Symp Proc. Washington, DC. 2013.
- Pradhan, Sameer; Elhadad, Noemie; South, Brett; Martinez, David; Christensen, Lee; Vogel, Amy; Suominen, Hanna; Chapman, Wendy, and Savova, Guergana. 2013. Task 1: ShARe/CLEF eHealth Evaluation Lab 2013. Proceedings of the ShARE/CLEF Evaluation Lab 2013. 
- Savova, Guergana; Chapman, Wendy; Elhadad, Noemie; Palmer, Martha. 2013. Panel: Shared resources, shared code, and shared activities in clinical natural language processing. AMIA Fall Annual Symposium.
- Shaodian, Zhang and Elhadad, Noemie. 2013. Unsupervised Biomedical Named Entity Recognition: Experiments with Clinical and Biological Texts. Journal of Biomedical Informatics. 46(6): 1088-1098. 
- Suominen Hanna; Salantarä, Sanna; Velupillai, Sumithra; Chapman, Wendy W; Savova, Guergana; Elhadad, Noemie; Pradhan, Sameer; South, Brett R; Mowery, Danielle L, Leveling, Johannes; Kelly, Liadh; Goeuriot, Lorraine; Martinez, David; Zuccon, Guido. Overview of the ShARe/CLEF eHealth evaluation lab 2013. Springer LNCS. 
- Savova, Guergana. 2012. Shared Annotated Resources for the Clinical Domain. Invited presentation at the Natural Language Processing Working Group Pre-Symposium – doctoral consortium and a data workshop. AMIA Fall Symposium, Nov. 2012, Chicago IL
- Savova, Guergana; Chapman, Wendy; Elhadad, Noemie. 2012. Shared Annotated Resources for the Clinical Domain. Invited presentation at the Natural Language Processing (NLP) Annotation workshop collocated with the 2nd annual IEEE International Conference on Healthcare Informatics, Imaging and Systems Biology, Sept. 2012. San Diego, CA
- CLEF/ShARe 2013: http://sites.google.com/site/shareclefehealth/
- CLEF/ShARe 2014 (in collaboration with the THYME project): http://clefehealth2014.dcu.ie/task-2
- SemEval 2014 Analysis of Clinical Text Task 7 (in collaboration with the THYME project): http://alt.qcri.org/semeval2014/task7/
- SemEval 2015 Analysis of Clinical Text Task 14 (in collaboration with the THYME project): http://alt.qcri.org/semeval2015/task14/
Getting Access to the ShARe Corpus and Gold Standard Annotations
The ShARe corpus consists of deidentified clinical free-text notes from the MIMIC II database, version 2.5 (mimic.physionet.org). Notes were authored in the ICU setting and note types include discharge summaries, ECG reports, echo reports, and radiology reports (for more information about the MIMIC II database, please see the MIMIC User Guide: http://mimic.physionet.org/UserGuide/UserGuide.pdf).
- Obtain a human subjects training certificate. If you do not have a certificate, you can take the CITI training course (http://www.citiprogram.org/Default.asp) or the NIH training course (http://phrp.nihtraining.com/users/login.php)
- Go to the Physionet site: http://physionet.org/mimic2/mimic2_access.shtml
- Click on the link for “creating a PhysioNetWorks account” (near middle of page) (http://physionet.org/pnw/login) and follow the instructions.
- Go to this site and accept the terms of the DUA: http://physionet.org/works/MIMICIIClinicalDatabase/access.shtml
- You will receive an email telling you to fill in your information on the DUA and email it back with your human subjects training certificate.