Overview

Language in its digital form is the most ubiquitous human product today. The amount of health-related text such as the clinical narrative from the electronic medical records, the text from online health communities and media as well as the biomedical scholarly literature have been growing exponentially. Coupled with the amazing advances in computational methods and hardware, this firehose stream presents the tech community the unique opportunity to be a major player in biomedical discoveries and healthcare personalization by syphoning the unwieldy into informational nuggets.

The Health Natural Language Processing (hNLP) Center targets a key challenge to current hNLP research and health-related human language technology development: the lack of health-related language data. Without shared data, the research community cannot build on each other’s scientific progress as they do in other disciplines where massive amounts of data are available. Even worse, the stakeholders and consumers of hNLP technology in health discovery and care have little access to robust hNLP technology, and are left with needing to implement all methods from scratch on their own data.

The Center builds on the rich experience of its founders – Prof. Guergana Savova (Harvard), Prof. Martha Palmer (University of Colorado) and Prof. Noemie Elhadad (Columbia University) – in the area of natural language processing. The Center follows the tradition of other successful data dissemination centers such as the Linguistic Data Consortium (LDC) and the European Language Resources Association (ELRA) for general language resources. The hNLP Center goes beyond these initiatives in addressing the very critical need of clinical text availability to advance health IT in general.

The Center’s mission is to support health language-related education, research and technology development by creating and sharing curated linguistic textual resources based on the principle that broad access to data drives innovation. The Center’s organizational structure is a not-for-profit consortium of members. Its fee-based membership is similar to the LDC’s and will ensure its sustainability. The highly sensitive clinical narrative (although completely de-identified) is distributed through a meticulously thought-out process. The Center is housed within the Computational Health Informatics Program (www.chip.org) at Boston Children’s Hospital, an affiliated Harvard University hospital, to build on existing security infrastructure. Industry members can obtain a commercial license to allow embedding models built from the Center’s data into products.

The Center’s primary activities are to:

Provide a repository and data curation, distribution and management point for health-related language resources
Support sponsored research programs and health-related language-based technology evaluations
Engage in collaborations with US and foreign researchers, institutions and data centers
Host and participate in various workshops

The Center addresses the emphasis on methodological robustness and reproducibility which are now required by the National Institutes of Health and the National Science Foundation. It also aligns with bold and ambitious national initiatives such as the cancer moonshot and the call for predictive models for personalized and precision medicine.

The hNLP Center is supported with seed funding from the National Institute of General Medical Sciences at the United States National Institutes of Health (grant R01GM114355).