OpenNLP Custom Model Training

The easy to follow tutorial to create custom built named entity recognition (NER) with Apache OpenNLP.

By Fahad Usman

You can read this to get started with OpenNLP but here is a tiny Intro what you need to train custom models:

The openNLP algorithm is created using the surrounding context (tokens) within a sentence; It’s not like a simple lookup mechanism. OpenNLP uses maximum entropy, which is a form of multinomial logistic regression to build its model. The reason for this is to reduce “word sense ambiguity,” and find entities in a given context.

For instance, if a person has a name April or May, then humans also could get confuse if it’s a person’s name or a month without any context provided. Additionally, “May” could be a verb as well.

Now you could try and create a model using a well known human names only but if you don’t provide a context then it will not train the model sufficiently or at all. You could use OpenNLP addon called the “modelbuilder addon” designed for this: you give it a file of names, and it uses the names and some of your data (sentences) to train a model. However, If you are looking for particular names of generally non ambiguous entities, you may be better off just using a list and something like regex to discover names rather than NER.

“Training Named entity without context will perform very poor”

In order to create a custom model, you need:

  1. Training File containing annotated spans about the class type in this format:
     Ali  
  2. The application must open a sample data stream
  3. Call the NameFinderME.train method
  4. Save the TokenNameFinderModel to a file for future use

So here is the code to train your own model:

package openNLP;

import java.io.BufferedOutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.charset.Charset;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.NameSample;
import opennlp.tools.namefind.NameSampleDataStream;
import opennlp.tools.namefind.TokenNameFinderFactory;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;

public class CustomClassifierTrainer {
//Where to save the model once trained	
  static String onlpModelPath = "/Users/user_name/eclipse-workspace/openNLP/OpenNLP_models/en-ner-asiannames.bin";
    // location of the training data set
    static String trainingDataFilePath = "/Users/user_name/eclipse-workspace/openNLP/trainingData/asiannames.txt";
 
	public static void main(String[] args) throws IOException {
		
	    Charset charset = Charset.forName("UTF-8");
	    		
	    ObjectStream lineStream =
	            new PlainTextByLineStream(() -> new FileInputStream(trainingDataFilePath), charset);
	    
	    ObjectStream sampleStream = new NameSampleDataStream(lineStream);

	    TokenNameFinderModel model;
	    TokenNameFinderFactory nameFinderFactory = new TokenNameFinderFactory();

	    try {
	      model = NameFinderME.train("en", 
                  "asian.person", 
                  sampleStream, 
                  TrainingParameters.defaultParams(),
                  nameFinderFactory);
	    }
	    finally {
	      sampleStream.close();
	    }
//Saving the model
	    BufferedOutputStream modelOut = null;
		try {
	      modelOut = new BufferedOutputStream(new FileOutputStream(onlpModelPath));
	      model.serialize(modelOut);
	    } finally {
	      if (modelOut != null) 
	         modelOut.close();      
	    }
	}
}

You need to be mindful of 2 things in the above code:

  1. you need to actually pass an InputStreamFactory type in the first argument of the train. Look at the javadoc of this interface, and you’ll see that it has a single method createInputStream() which takes nothing as argument, and is supposed to create and return an InputStream.A valid value would thus be
    () -> new FileInputStream(trainingDataFilePath)

    i.e. a lambda which takes no input and create a new input stream, and can thus be inferred to an InputStreamFactory.

  2. you’re not supposed to specify the types of the arguments when calling a method. Only when defining a method. So
    NameFinderME.train("en", 
                       "asian.person", 
                       sampleStream, 
                       TrainingParameters.defaultParams(),
                       TokenNameFinderFactory nameFinderFactory);

    should be

    NameFinderME.train("en", 
                       "asian.person", 
                       sampleStream, 
                       TrainingParameters.defaultParams(),
                       nameFinderFactory);

Therefore, you will need to declare and instantiate nameFinderFactory before calling the NameFinderME.train() function.

And now it’s time to test your custom model and here is how you could do that:


// to test the builtin name finder model
public class nameFinder {
	
	 public void findName(String paragraph) throws IOException {
	        InputStream inputStream = new FileInputStream("/Users/Fahad/eclipse-workspace/openNLP/OpenNLP_models/en-ner-person.bin");
	        TokenNameFinderModel model = null;
	    				try {
	    					model = new TokenNameFinderModel(inputStream);
	    				} catch (IOException e) {
	    					// TODO Auto-generated catch block
	    					e.printStackTrace();
	    				} 
	        NameFinderME nameFinder = new NameFinderME(model);
	        String[] tokens = tokenise(paragraph);

	        Span nameSpans[] = nameFinder.find(tokens);
	        for(Span s: nameSpans) {
	            System.out.println(tokens[s.getStart()]);
	            System.out.println(s.getType()+" : "+tokens[s.getStart()]+"\t [probability="+s.getProb()+"]");
	        }
	    }
// to test our newly built custom model	 
	 public void asianFindName(String paragraph) throws IOException {
	        InputStream inputStream = new FileInputStream("/Users/Fahad/eclipse-workspace/openNLP/OpenNLP_models/en-ner-asiannames.bin");
	        TokenNameFinderModel model = null;
	    				try {
	    					model = new TokenNameFinderModel(inputStream);
	    				} catch (IOException e) {
	    					// TODO Auto-generated catch block
	    					e.printStackTrace();
	    				} 
	        NameFinderME nameFinder = new NameFinderME(model);
	        String[] tokens = tokenise(paragraph);

	        Span nameSpans[] = nameFinder.find(tokens);
	        for(Span s: nameSpans) {
	            System.out.println(tokens[s.getStart()]);
	            System.out.println(s.getType()+" : "+tokens[s.getStart()]+"\t [probability="+s.getProb()+"]");
	        }
	    }
// both methods above need to tokenise the sentence first before extracting NER
	    public String[] tokenise(String sentence) throws IOException{
	        InputStream inputStreamTokenizer = new FileInputStream("/Users/Fahad/eclipse-workspace/openNLP/OpenNLP_models/en-token.bin");
	        TokenizerModel tokenModel = new TokenizerModel(inputStreamTokenizer);
	        TokenizerME tokenizer = new TokenizerME(tokenModel);
	        return tokenizer.tokenize(sentence);
	    }

public static void main(String[] args) throws Exception {
	
	nameFinder namefinder = new nameFinder();
	namefinder.findName("Where is Charlie and Mike.");
	namefinder.findName("Fraser is my son.");
	namefinder.findName("I love Dominoes.");
	namefinder.findName("I love Seb.");
	
	namefinder.asianFindName("Salah is not my relative.");
	namefinder.asianFindName("Salah is not my relative, will be going school soon.");
	namefinder.asianFindName("I love Dominoes.");
	namefinder.asianFindName("I love Mr. Noor.");
	namefinder.asianFindName("Mr. Ching is my son.");
	namefinder.asianFindName("Where is Charlie and Mike.");
	}

}

The default TrainingParameters are defined as:

public static TrainingParameters defaultParams() {
    TrainingParameters mlParams = new TrainingParameters();
    mlParams.put(TrainingParameters.ALGORITHM_PARAM, "MAXENT");
    mlParams.put(TrainingParameters.TRAINER_TYPE_PARAM, EventTrainer.EVENT_VALUE);
    mlParams.put(TrainingParameters.ITERATIONS_PARAM, 100);
    mlParams.put(TrainingParameters.CUTOFF_PARAM, 5); 

    return mlParams;
  }

You can also change the default training parameters by tweaking the TrainingParameters like:

TrainingParameters paramaters = new TrainingParameters();
            paramaters.put(TrainingParameters.ITERATIONS_PARAM, 100);
            paramaters.put(TrainingParameters.CUTOFF_PARAM, 1);
            paramaters.put(TrainingParameters.ALGORITHM_PARAM, "NAIVEBAYES");
TokenNameFinderFactory tnff = new TokenNameFinderFactory();
model = NameFinderME.train("en", modelName, sampleStream, parameters, tnff);

The result of the model is that it will extract Noor for the asian.names but skip others. It is because the training data is very small. You should consider at-least 15000 sentences for it to learn the names and the context.

You can also clone the eclipse project here.

Hope this helps!

Leave a Reply

Close Menu