Difference between revisions of "Main Page"

Revision as of 16:29, 4 August 2016

Cancer is a genomic disease, with enormous heterogeneity in its behavior. In the past, our methods for categorization, prediction of outcome, and treatment selection have relied largely on a morphologic classification of cancer. But new technologies are fundamentally reframing our views of cancer initiation, progression, metastasis, and response to treatment; moving us towards a molecular classification of cancer. This transformation depends not only on our ability to deeply investigate the cancer genome, but also on our ability to link these specific molecular changes to specific tumor behaviors. As sequencing costs continue to decline at a supra-Moore’s law rate, a torrent of cancer genomic data is looming. However, our ability to deeply investigate the cancer genome is outpacing our ability to correlate these changes with the phenotypes that they produce. Translational investigators seeking to associate specific genetic, epigenetic, and systems changes with particular tumor behaviors, lack access to detailed observable traits about the cancer (the so called ‘deep phenotype’), which has now become a major barrier to research.

We propose the advanced development and extension of a software platform for performing deep phenotype extraction directly from medical records of patients with cancer, with the goal of enabling translational cancer research and precision medicine. The work builds on previous informatics research and software development efforts from Boston Children’s Hospital and University of Pittsburgh groups, both individually and together. Multiple software projects developed by our groups (some initially funded by NCI) that have already passed the initial prototyping and pilot development phase (eMERGE, THYME, TIES, ODIE, Apache cTAKES) will be combined and extended to produce an advanced software platform for accelerating cancer research. Previous work in a number of NIH-funded translational science initiatives has already demonstrated the benefits of these methodologies (e.g. Electronic Medical Record and Genomics (eMERGE), PharmacoGenomics Research Network (PGRN), SHARPn, i2b2). However, to date these initiatives have focused exclusively on select non-cancer phenotypes and have had the goal of dichotomizing patients for a particular phenotype of interest (for example, Type II Diabetes, Rheumatoid Arthritis, or Multiple Sclerosis). In contrast, our proposed work focuses on extracting and representing multiple phenotype features for individual patients, to build a cancer phenotype model, relating observable traits over time for individual patients.

Goals

Our first four development specific aims significantly extend the capability of our current software, focusing on challenging problems in biomedical information extraction. These aims support the development and evaluation of novel methods for cancer deep phenotype extraction:

Specific Aim 1: Develop methods for extracting phenotypic profiles. Extract patient’s deep phenotypes, and their attributes such as general modifiers (negation, uncertainty, subject) and cancer specific characteristics (e.g. grade, invasion, lymph node involvement, metastasis, size, stage)

Specific Aim 2: Extract gene/protein mentions and their variants from the clinical narrative

Specific Aim 3: Create longitudinal representation of disease process and its resolution. Link phenotypes, treatments and outcomes in temporal associations to create a longitudinal abstraction of the disease

Specific Aim 4: Extract discourses containing explanations, speculations, and hypotheses, to support explorations of causality

Our last two implementation specific aims focus on the design of the software to support the cancer research community, ensuring the usability and utility of our software. These aims support the design, dissemination and sharing of the products of this work to maximize impact on cancer research:

Specific Aim 5: Design and implement a computational platform for deep phenotype discovery and analytics for translational investigators, including integrative visual analytics.

Specific Aim 6: Advance translational research in driving cancer biology research projects in breast cancer, ovarian cancer, and melanoma. Include research community throughout the design of the platform and its evaluation. Disseminate freely available software.

Impact: The proposed work will produce novel methods for extracting detailed phenotype information directly from the EMR, the major source of such data for patients with cancer. Extracted phenotypes will be used in three ongoing translational studies with a precision medicine focus. Dissemination of the software will enhance the ability of cancer researchers to abstract meaningful clinical data for translational research. If successful, systematic capture and representation of these phenotypes from EMR data could later be used to drive clinical genomic decision support.

Who We Are

Boston Childrens Hospital/Harvard Medical School

Guergana Savova (MPI)

Dmitriy Dligach

Timothy Miller

Sean Finan

David Harris

Chen Lin

Ethan Hartzell

University of Pittburgh

Rebecca Crowley Jacobson (MPI)

Harry Hochheiser

Roger Day

Adrian Lee

Robert Edwards

John Kirkwood

Kevin Mitchell

Eugene Tseytlin

Girish Chavan

Liz Legowski (through Jan 2015)

Melissa Castine

Publications and Presentations

The following are publications and presentations crediting the Cancer Deep Phenotype Extraction (DeepPhe) project:

Hochheiser H; Jacobson R; Washington N; Denny J; Savova G. 2015. Natural language processing for phenotype extraction: challenges and representation. AMIA Annual Symposium. Nov 2015, San Francisco, CA.
Dmitriy Dligach, Timothy Miller, Guergana K. Savova. 2015. Semi-supervised Learning for Phenotyping Tasks. AMIA Annual Symposium. Nov 2015, San Francisco, CA.
Chen, Lin; Dligach, Dmitriy; Miller, Timothy; Bethard, Steven; Savova, Guergana. 2015. Layered temporal modeling for the clinical domain. Journal of the American Medical Informatics Association. http://jamia.oxfordjournals.org/content/early/2015/10/31/jamia.ocv113

Funding

The project described is supported by Grant Number 1U24CA184407-01 from the National Cancer Institute at the US National Institutes of Health. This work is part of the NCI's Informatics Technology for Cancer Research (ITCR) Initiative. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

The project period is May 2014 - April, 2019.

Stakeholder Information

User Personae

Contextual Interview Protocol

During the interview, we will ask the informant to conduct relevant tasks, explaining what is being done, why it must be done, and how it is done. As the informant continues through the task, the interviewer(s) will ask for clarification, pose questions about contextual factors, and generally attempt to build an understanding of the work. Therefore, the interview will be unstructured and much of the discussion will be determined by the events that occur as the informant completes their tasks.

Below is some broad structure for the interview. However, no ordering of these questions is implied - rather, they can be asked when convenient during the course of the discussion.

Recording

All sessions should be audio-recorded with the participant’s permission, either via web conference software or portable digital audio recorder. Written notes will be taken as well. If interviews are conducted via web-conference, screen captures will be recorded as well. We will use GoToMeeting for web conferencing. Artifacts such as data files would also be useful when available.

Steps will be taken to avoid protected health information (PHI). We will try to remove PHI data entirely from consideration. We will only observe what participants do if it does not involve any PII. If any displays or artifacts involving PHI are inadvertently introduced into the discussion, any screen capture or recording modes that might capture the PHI will be disabled immediately.

Interview Introduction

The introductory component of the discussion is designed to provide basic context and to start conversation:

Introductory script:

"Hello. My name is XXXX. As you know, you’re here today to participate in a contextual inquiry study to understand the requirements for the visual analytics tools we will be developing. These tools will help researchers explore and interpret phenotype data that will be extracted with our software. We would like to understand users’ information and workflow needs to maximize the usefulness of the software.

During the session, you will be questioned and observed. We will ask some general questions about your work, and more detailed questions about specific tasks. Data will be collected in the form of notes and artifacts. With your permission, the session will be audio-recorded and captured via screen capture software. The session will last approximately one hour.

In this session, we would like to understand your broad research and clinical goals, and then explore in detail how you currently identify, store, and manipulate specific phenotype data and link this to genotype data in your work. We’d like to understand how this work is done, any challenges you face, and what you would hope for in a new tool to make work more efficient and easier to accomplish. If possible, we’d like you to demonstrate the tools that you currently use on some meaningful data. We’d like you to identify a representative dataset and task that you are currently working on or have worked on previously to demonstrate how you currently do things.

Do you have any questions?

Before we begin, do we have your permission to record this session? (IF YES, BEGIN RECORDING)

To start off, we’d like to get a sense of your background and research."

Respondent background:

What is your training? What degrees have you obtained, and in what areas?
What is your current position? How long have you been at this position?
How much of your time do you spend on cancer data analytics?
If applicable: How much of your time is spent on clinical work?

Problem Description

What questions are you currently trying to answer related to cancer analytics?
What sort of data are you dealing with?
Is this a research or clinical effort?
Who is this effort supported by?

Data

Which tools do you currently work with for your data?
What are the major data challenges that you struggle with?

Major Non-data Challenges:

What are the major challenges that you struggle with that are not data related?

Task Identification

Now we’d like you to identify a specific dataset and tasks that fit within the scope of the work. This should be representative of your work and illustrative of difficulties you encounter. Please choose something that does not involve sensitive data.

Have you selected a task or tasks? Please explain the task.

Will demonstrating this task involve any protected health information? (IF YES, WE WILL NOT OBSERVE BUT JUST HAVE THE PARTICIPANT EXPLAIN HOW THEY DO THEIR WORK)

Note: If participants identify multiple important tasks, they should be prioritized so that the session will start with the highest priority and continue from there as time allows.

Information modeling interview protocol

As we develop draft and final models, we will be validating them with the domain scientists who are collaborators on our grant. The purpose of this validation is four-fold:

Assure face validity and scientific relevance of the models themselves
Prioritize information extraction targets by identifying "most wanted" elements
Identify value proposition by determining elements most difficult to obtain from structured data.
Align the semantic model with the scientific "mental model". This will make downstream aspects of the software (including the visualization) more organic and presumably easier to accept.

Our goal is thus to ask each of our three domain experts the following:

Based on their specific scientific needs,

Review our existing model attributes and constraints/value sets, and prioritize them
Identify missing model attributes and constraints/value sets which will be added to the models
Identify the attributes that are hardest to obtain from existing (mainly structured) data sources
Determine which relationships are most important to represent
Enumerate values within value sets

We propose to use a card sort task to accomplish this.

Materials

Index cards with printed labels. Each card will have the attribute and a set of values (in lieu of a definition)
Colored Markers to write categories on cards after the sorts
Extra blank index cards
Script
Audio recorder

Competency questions

Validation of the DeepPhe models requires consideration of the types of questions that they might (and might not) be able to answer. Although such efforts are almost by definition limited in their scope and biased in their content, we attempt to list some questions that might be asked by cancer researchers. These questions might be based in our discussions with collaborators, review of the literature, or other sources.

More details available here.

Contextual Design Information

See presentation here.

Licensing

Goals

Encourage the broad use of DeepPhe tools and models, while maintaining openness and attribution.
Comply with resource sharing plan as submitted to NIH. See, for example [ITCR policies](http://itcr.nci.nih.gov/about-itcr)

Distinction

We are reading "software" as referring to executable components, while "models" refers to the OWL data models developed for representation of cancer phenotypes.

Licensing of Software

cTAKES is licensed under the Apache Source License 2.0
DeepPhe cTAKES components will be similarly licensed
Other DeepPhe components will be licensed by some open source license, potentially (but not necessarily) the Apache license.

Licensing of Other Models

List of non-software licenses -- https://en.wikipedia.org/wiki/List_of_free_content_licenses
- Licensing for content are appropriate
Principles
- receive credit for what we do
- allow other people to contribute (not just take it and use it) -> share alike
Creative Commons
- Creative Commons does not recommend the use of Creative Commons licenses for software.
- https://en.wikipedia.org/wiki/Creative_Commons_license
- CC BY 2.0 - https://creativecommons.org/licenses/by/2.0/
  - you are free to share and adapt when you give attribution (You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.)
- CC BY 3.0 - http://creativecommons.org/licenses/by/3.0/
- is it compatible with Apache license?
  - CC BY 2.5 and 3.0 are in Category B - http://www.apache.org/legal/resolved.html
  - Unmodified media under the Creative Commons Attribution-Share Alike 2.5 and Creative Commons Attribution-Share Alike 3.0 licenses may be included in Apache products, subject to the licenses attribution clauses which may require LICENSE/NOTICE/README changes. For any other type of CC-SA licensed work, please contact the Legal PMC.
- CY BY 4.0 http://creativecommons.org/licenses/by/4.0/
  - is it compatible with the Apache license? We believe so.. see [this thread](http://apache.markmail.org/search/?q=legal-discuss#query:legal-discuss%20order%3Adate-backward+page:1+mid:3whmjdfqnvukcg2s+state:results)
Open Content
- https://en.wikipedia.org/wiki/Open_Content_License -- The Open Content License is a share-alike public copyright license which can be applied to a work to make it open content. This license is not compatible with any other license in that it requires derivative works to be licensed under the Open Content License.
- probably not compatible with Apache.
Art Libre
- https://en.wikipedia.org/wiki/Free_Art_License - a copyleft license that grants the right to freely copy, distribute, and transform creative works without the author's explicit permission

DeepPhe Ontology Models

DeepPhe Ontology models will be licensed using CC-BY 4.0
This is consistent with plans to use CC-BY 4.0 to license the schema ontology upon which the DeepPhe model is based.

Contact

If you need assistance or if you have further questions about the project, feel free to e-mail Guergana.Savova@childrens.harvard.edu or to Rebecca Crowley Jacobson at rebeccaj@pitt.edu.

Difference between revisions of "Main Page"

Revision as of 16:29, 4 August 2016

Contents

Goals

Who We Are

Publications and Presentations

Funding

Stakeholder Information

User Personae

Contextual Interview Protocol

Recording

Interview Introduction

Respondent background:

Problem Description

Data

Major Non-data Challenges:

Task Identification

Information modeling interview protocol

Materials

Competency questions

Contextual Design Information

Licensing

Goals

Distinction

Licensing of Software

Licensing of Other Models

DeepPhe Ontology Models

Contact

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Stakeholder Information

Tools

@@ Line 102: / Line 102: @@
 Introductory script:
-Hello. My name is XXXX. As you know, you’re here today to participate in a contextual inquiry study to understand the requirements for the visual analytics tools we will be developing. These tools will help researchers explore and interpret phenotype data that will be extracted with our software. We would like to understand users’ information and workflow needs to maximize the usefulness of the software.
+"Hello. My name is XXXX. As you know, you’re here today to participate in a contextual inquiry study to understand the requirements for the visual analytics tools we will be developing. These tools will help researchers explore and interpret phenotype data that will be extracted with our software. We would like to understand users’ information and workflow needs to maximize the usefulness of the software.
 During the session, you will be questioned and observed. We will ask some general questions about your work, and more detailed questions about specific tasks. Data will be collected in the form of notes and artifacts. With your permission, the session will be audio-recorded and captured via screen capture software. The session will last approximately one hour.
@@ Line 112: / Line 112: @@
 Before we begin, do we have your permission to record this session? (IF YES, BEGIN RECORDING)
-To start off, we’d like to get a sense of your background and research.
+To start off, we’d like to get a sense of your background and research."
 =====Respondent background:=====