BioC Format

Argo supports BioC format by introducing the BioC type system as well as two processing components, BioC Reader and BioC Writer. The components are capable of (de)serialising BioC collections from/to the BioC Type System.

About BioC

The BioC format is encoded in XML and consists of a collection of documents, each split into passages and optionally sentences. These elements may contain stand-off annotations with optional text-bound locations as well as n-ary relations between annotations and other relations. Virtually all elements may declare a list of key-value pairs for storing arbitrary data.

The format is actively promoted by the BioCreative Interoperability Initiative whose aim is to enhance the reusability of tools and resources.

Resources

The following files are BioC-encoded corpora used in the BioNLP Shared Task series.

Corpus Training set Development set Entities Events Equivalent entities Modifications Coreferences
GE’11 908 259 Yes Yes Yes Yes No
EPI 600 200 Yes Yes Yes Yes No
ID 152 46 Yes Yes Yes Yes No
GE’13 222 249 Yes Yes Yes Yes Yes
CG 300 100 Yes Yes Yes Yes No
PC 260 90 Yes Yes Yes Yes No

We also provide the BioC-encoded version of NaCTeM’s Metabolites corpus.

Workflows

Two of our BioC-compliant modules are realised as workflows in Argo. They are named BioC Event Extraction and BioC Metabolic processes. Each of them includes the BioC Reader and Writer components that allow users to upload their BioC files for processing as well as retrieve the results in the same format.

Before running the workflows, please consult the tutorials page on how to perform the following in Argo:

  • set up component parameters,
  • upload and download documents,
  • run workflows, and
  • track the progress of processing workflows.

Follow the steps below for running either of the workflows.

Step 1. Sign in

If you have not done so yet, create an account in Argo and sign in to it. Although it is possible to use Argo without registering, any user-created data (workflows and documents) will be automatically deleted at the end of a visit.

Tip: Registering an account in Argo requires a valid email address for verification that is sent immediately after creating an account. Please check your email spam/junk folders if you do not receive the verification email within minutes.

Step 2. Upload BioC documents

Upload the BioC XML files that you wish to be processed by the workflow.

Step 3. Create copies of the public workflow

Each of the workflows is publicly available for reading only. In order to be able to change their settings (for example, to specify an input BioC file) you have to make a copy of the public workflow. Edit the workflow and save its copy. It is advisable to name the copy with something distinguishable.

Step 4. Change the workflow’s settings

Configure the BioC Reader and BioC Writer components of the workflow by specifying the input and output files, respectively. If it is the BioC Event Extraction workflow that you are running, please configure the EventMine component too, by choosing the appropriate task-specific model (e.g., Cancer Genetics 2013).

Step 5. Run the workflow

Run the workflow. On large BioC collections, it might take a while for a process to be completed. Once a process’ status is shown as Finished, the output BioC file should be ready for download.

Web services

We also developed BioC-compliant web services for recognising concepts in the Comparative Toxicogenomics Database (CTD). They are accessible at the following locations:

They can be tested using the facility provided by organisers of the BioCreative IV CTD track.