README.bioc

## A BioC XML to A BioC XML implementation of MetaMapLite

The class gov.nih.nlm.nls.metamap.lite.BioCProcess allows MetaMapLite
to process BioC XML input and write the results in BioC XML.

### Using from Java

Below is an example using BioC Processing with MetaMapLite

    // Initialize MetaMapLite
    Properties defaultConfiguration = getDefaultConfiguration();
	String configPropertiesFilename =
		System.getProperty("metamaplite.propertyfile",
		                   "config/metamaplite.properties");
    Properties configProperties = new Properties();
    // set any in-line properties here.
    FileReader fr = new FileReader(configPropertiesFilename);
    configProperties.load(fr);
    fr.close();

    configProperties.setProperty("metamaplite.semanticgroup",
    "acab,anab,bact,cgab,dsyn,emod,inpo,mobd,neop,patf,sosy");
    Properties properties =
        Configuration.mergeConfiguration(configProperties,
					   defaultConfiguration);
    BioCProcess process = new BioCProcess(properties);

    // read BioC XML collection
    Reader inputReader = new FileReader(inputFile);
    BioCFactory bioCFactory = BioCFactory.newFactory("STANDARD");
    BioCCollectionReader collectionReader =
    bioCFactory.createBioCCollectionReader(inputReader);
    BioCCollection collection = collectionReader.readCollection();

    // Run named entity recognition on collection
    BioCCollection newCollection = process.processCollection(collection);

    // write out the annotated collection
    File outputFile = new File(outputFilename);
    Writer outputWriter = new PrintWriter(outputFile, "UTF-8");
    BioCCollectionWriter collectionWriter = bioCFactory.createBioCCollectionWriter(outputWriter);
    collectionWriter.writeCollection(newCollection);
    outputWriter.close();

This process attempts to preserve any annotation present in the
original BioC XML input.    Some annotations applied to BioC
structures before entity lookup, including `tokenization and
part-of-speech-tagging, are discarded before the writing out the final
annotated collection.   This occurs in the entity lookup class
BioCEntityLookup (java package: gov.nih.nlm.nls.metamap.lite).

### Omissions in BioC version

The current version of the BioCProcess class does not call the
abbreviation detector or the negation detector.   This should be
relatively simple to add (particularly the abbreviation detector) and
will probably be added in the next release.