This document provides instructions and guidelines for an annotation task prepared specifically by the National Centre for Text Mining (NaCTeM) for BioCreative V’s User Interactive Task. The task involves the curation of phenotypes relevant to the chronic obstructive pulmonary disease (COPD).
This tutorial includes everything that is necessary to complete this or similar tasks in Argo.
Background
Chronic obstructive pulmonary disease is a category of medical conditions characterised by blockage of the lung airways and breathing difficulties [1]. In 2011, it was the leading cause of death in the United States, and has been predicted to become the third one worldwide by 2030 [2].
Phenotypes are an organism’s observable traits and help in uncovering the underlying mechanisms of a patient’s medical condition. In the case of COPD, disease and clinical manifestations are heterogeneous and widely vary from one patient to another. Methods for identifying phenotypes (i.e., COPD phenotyping) have thus been adopted to allow for the well-defined categorisation of COPD patients according to their prognostic and therapeutic characteristics.
The task of identifying phenotypes within narratives and documents, i.e., phenotype curation, is a widely adopted practice. A phenotypic concept can be expressed within text in various ways. The phenotype pertaining to blockage of lung airways, for example, can take the form of any of the following variants and more: airways are blocked, blocked airways, blockage of airways, airways obstruction, obstructed lung airways, obstruction of airways.
The tasks
To facilitate the curation of COPD phenotypes, we are providing text mining-based support for three tasks: (1) the recognition of phenotypic mentions, (2) the normalisation of phenotypic mentions to relevant ontologies, and (3) the detection of relations between mentions.
Recognition of phenotypic mentions
Expressions denoting COPD phenotypes will be demarcated and assigned semantic categories in this task, similar to named entity recognition (NER). Phenotypes of interest fall under any of the following categories:
- medical condition
- sign or symptom
- protein
- drug
Normalisation of phenotypic mentions to relevant ontologies
Since a phenotype can appear in text in various forms, the normalisation of such surface forms to corresponding entries in controlled vocabularies or ontologies has become a crucial step in phenotype curation. Argo’s text mining tools will link the mentions recognised in the previous task, to corresponding concepts in COPD-relevant controlled vocabularies and ontologies, such as the following:
- Unified Medical Language System (medical conditions and signs or symptoms)
- UniProt (proteins)
- Chemical Entities of Biological Interest or ChEBI (drugs)
Detection of relations between mentions
As an ultimate task, our tools will detect binary relations between COPD and any other mention falling under our semantic categories of interest. In this way, the curated relations will be able to answer the following questions:
- Which other medical conditions (e.g., comorbidities) are associated with COPD?
- Which signs or symptoms are indicative of COPD?
- Which proteins underlie the mechanisms of COPD?
- Which drugs affect COPD?
Instructions
References
[1] http://www.cdc.gov/copd/index.html[2] http://www.who.int/respiratory/copd/en