Curation of COPD Phenotypes

This document provides instructions and guidelines for an annotation task prepared specifically by the National Centre for Text Mining (NaCTeM) for BioCreative V’s User Interactive Task. The task involves the curation of phenotypes relevant to the chronic obstructive pulmonary disease (COPD).

This tutorial includes everything that is necessary to complete this or similar tasks in Argo.


Chronic obstructive pulmonary disease is a category of medical conditions characterised by blockage of the lung airways and breathing difficulties [1]. In 2011, it was the leading cause of death in the United States, and has been predicted to become the third one worldwide by 2030 [2].

Phenotypes are an organism’s observable traits and help in uncovering the underlying mechanisms of a patient’s medical condition. In the case of COPD, disease and clinical manifestations are heterogeneous and widely vary from one patient to another. Methods for identifying phenotypes (i.e., COPD phenotyping) have thus been adopted to allow for the well-defined categorisation of COPD patients according to their prognostic and therapeutic characteristics.

The task of identifying phenotypes within narratives and documents, i.e., phenotype curation, is a widely adopted practice. A phenotypic concept can be expressed within text in various ways. The phenotype pertaining to blockage of lung airways, for example, can take the form of any of the following variants and more: airways are blocked, blocked airways, blockage of airways, airways obstruction, obstructed lung airways, obstruction of airways.

The tasks

To facilitate the curation of COPD phenotypes, we are providing text mining-based support for three tasks: (1) the recognition of phenotypic mentions, (2) the normalisation of phenotypic mentions to relevant ontologies, and (3) the detection of relations between mentions.

Recognition of phenotypic mentions

Expressions denoting COPD phenotypes will be demarcated and assigned semantic categories in this task, similar to named entity recognition (NER). Phenotypes of interest fall under any of the following categories:

  • medical condition
  • sign or symptom
  • protein
  • drug

Normalisation of phenotypic mentions to relevant ontologies

Since a phenotype can appear in text in various forms, the normalisation of such surface forms to corresponding entries in controlled vocabularies or ontologies has become a crucial step in phenotype curation. Argo’s text mining tools will link the mentions recognised in the previous task, to corresponding concepts in COPD-relevant controlled vocabularies and ontologies, such as the following:

  • Unified Medical Language System (medical conditions and signs or symptoms)
  • UniProt (proteins)
  • Chemical Entities of Biological Interest or ChEBI (drugs)

Detection of relations between mentions

As an ultimate task, our tools will detect binary relations between COPD and any other mention falling under our semantic categories of interest. In this way, the curated relations will be able to answer the following questions: