Curation of COPD Phenotypes: Annotation Guidelines

Contents

This document is part of Curation of COPD Phenotypes.

Annotation of COPD Phenotypic Concepts

For the first phase of the annotation task, mentions pertaining to COPD phenotypic concepts will be annotated. This involves marking them up within text (by means of highlighting and assigning the appropriate annotation label) and linking them to external databases. We are interested in the following four types:

  • MedicalCondition: a disease or medical condition, including COPD comorbidities; to be linked to concepts in the Unified Medical Language System (UMLS)
  • SignOrSymptom: an observable irregularity manifested by a COPD patient; to be linked to concepts in the Unified Medical Language System (UMLS)
  • Drug: a drug name; to be linked to concepts in the Chemical Entities of Biological Interest (ChEBI) database
  • Protein: a protein name; to be linked to concepts in the UniProt Knowledge Base

In the non-text mining-assisted mode of annotation, you will have to create annotations from scratch. For the text mining-assisted mode of annotation, the documents you will receive already contain annotations automatically generated by text mining. These annotations, however, are by no means perfect, and thus your task will involve operations such as: (a) deleting unwanted annotations, (b) correcting wrongly assigned concept labels, (c) adjusting the spans or boundaries of annotations, (d) adding missed mentions, and (e) correcting or providing unique IDs from external vocabularies/ontologies. Instructions on how to do these in Argo can be found here.

Scope

In this section, we describe the scope (i.e., what should be considered for marking up) of the concept annotation. In the examples provided, mentions which should be annotated are shown inside square brackets, e.g., [salbutamol].

✔Include:

  1. Concepts which fall under any of the four types of interest described above, regardless of how relevant they are to COPD.
    • ABG analysis revealed [respiratory failure] in 40 patients.
  2. Abbreviations. The full name and abbreviation should be annotated separately.
    • [Chronic obstructive pulmonary disease] ([COPD]) is a leading cause of death.
  3. Mentions subsumed by other names, but only if the subsuming name is of a different concept type and the subsumed name can be considered an independent token. Note how in the first example below, “COPD” (a medical condition) was annotated apart from “AE-COPD” (a sign or symptom). In the second example though, “COPD” was not marked up since it does not appear as an independent token and therefore forms part of the word “AECOPD”. In the third example, “TB” (a medical condition) was not annotated separately since it is subsumed by another medical condition, “pulmonary TB”.
    • Accessibility to health care contributes to morbidity due to [AE-[COPD]].
    • Most of the patients suffered from [AECOPD].
    • Further research into [pulmonary TB] is required.

✖Exclude:

  • Concepts which are too general.
    • Little has been reported on these [adhesion molecules].

Span

In this section, we describe the span (i.e., what exactly should be marked up in text). As above, we use square brackets to indicate annotation boundaries, e.g., [chronic obstructive pulmonary disease]. Those which should not be annotated are additionally crossed through, e.g., [compounds].

In general, only the minimal span of text containing the phenotypic mention should be annotated and nothing else.

✔Include:

  1. Modifiers, but only if they are part of the name, e.g., “interstitial” in the example below.
    • Patients with [interstitial lung disease] were excluded from the study.
  2. Head words, but only if they are part of the name, e.g., “disease” in the example below.
    • Some of the patients where suffering from [GOLD stage IV disease].

✖Exclude:

  1. Modifiers which are not part of the name.
    • [Treated] [pulmonary TB] is a cause of [COPD].
  2. Head words which are not part of the name.
    • This result has been previously associated with the presence of the [alpha1 antitrypsin] [protein].
  3. Characters appearing in the same token as the name but are not part thereof.
    • One of consequences is [theophylline][-induced] [[adenosine] antagonism].

Annotation of Relations Pertaining to COPD

For the second phase of the annotation task, relationships involving COPD will be annotated. At this point, we already have concept annotations (from the previous phase) at hand, and thus the focus now is on forming links (i.e., relations) between mentions of COPD and related concepts. A relation annotation has two elements, which appear as slots called mention1 and mention2 in Argo. In our case, one of these two elements should point to a COPD mention, e.g., mention1= “COPD” and mention2 = “theophylline”. In the documents that you will work on, all instances of “COPD” and “chronic obstructive pulmonary disease” have been assigned a COPD label, to make the task hopefully a bit easier.

In the non-text mining-assisted mode of relation annotation, you will have to create relation annotations yourself. In the text mining-assisted mode, relation annotations produced by text mining are provided to you. Your task thus involves: (a) removing unwanted relation annotations; (b) creating new ones to account for those missed by text mining; and (c) changing the value assigned to either of the mention1 or mention2 slots of an existing relation annotation. Detailed instructions on how to carry these out using Argo are here.

Relation types

Below are the types of relations that we are interested in, for this task. In annotating any of these relation types, please bear in mind that you should link COPD with another concept only if the containing statement does describe a relation between them. If, for example, based on your domain expertise, you know that COPD and another mention have been proven to be associated with each other, but the sentence was merely enumerating them, or they were mentioned in the same sentence just by coincidence, you should not create a relation annotation to link them up.

  • COPD-MedicalCondition. Link the COPD mention with another MedicalCondition if the MedicalCondition is implied to be a comorbidity of COPD, a complication of COPD, or complicates COPD.
  • COPD-SignOrSymptom. Link COPD mention with a SignOrSymptom if the SignOrSymptom can be considered as an indication or outcome of COPD, according to the statement.
  • COPD-Drug. Link the COPD mention with a Drug if the Drug is described as having an effect on COPD.
  • COPD-Protein. Link COPD mention with a Protein if the Protein is described as playing a role in the underlying mechanisms of COPD.

Frequently Asked Questions

  • I noticed that all COPD mentions are marked up as both COPD and MedicalCondition. Which of the two annotations should I include in the relation annotation I’m creating?
  • When creating relation annotations, please make use of the COPD annotation rather than the MedicalCondition one.
  • Does the order of mentions (that I include in a relation annotation) matter? For example, does mention1 always have to be the COPD mention?
  • No, the order does not matter. You can assign the COPD mention to either of the mention1 or mention2 slots.
  • I found a mention of COPD but it was not annotated as such. There is a related concept in the same sentence that it appears in. What should I do?
  • Please annotate the mention you found as a COPD mention. The process is the same as in Phase 1, only this time, you should assign uk.ac.nactem.uima.phenotypes.COPD as the annotation type/label. Once you’ve done this, you can proceed to the creation of a relation annotation.
  • Why does my data set include documents that do not contain any COPD mentions, if our focus is on annotating relations that involve COPD?
  • These were included in your data set anyway to account for the possibility that our automatic, string matching-based tool for marking up COPD mentions (“COPD”, “chronic obstructive pulmonary disease”) might have missed mentions that you think can be considered as COPD. Let’s say, based on your domain expertise, you consider “chronic bronchitis” as a COPD mention. Since this wouldn’t have been marked up by our tool as a COPD mention, you might want to annotate this as one (please refer to the previous FAQ item). Any documents that do not contain any COPD mentions at all can be simply skipped over.
  • I found a relation between COPD and another concept, but the other concept is in a different sentence as COPD. Should I annotate this relation?
  • No. For now, we will not include inter-sentential relations. To deal with those properly, we will need to be able to capture anaphora (e.g., the use of pronouns), but this is outside of our current scope.
  • I found a concept, e.g., a drug, which I know for a fact is related to COPD, but COPD itself is not mentioned in the text. Should I somehow annotate it?
  • No. We can annotate only relations linking text-bound concepts, i.e., concepts which are explicitly mentioned in the text. Also, even if both COPD and the drug of interest, for example, are mentioned in the text, you should check that the sentence containing them does describe a relationship between the two.