Custom Model training using CRFSuite in R

By Fahad Usman

Introduction:

CRFsuite is an implementation of Conditional Random Fields algorithm. 

This R package wraps the CRFsuite C/C++ library (https://github.com/chokkan/crfsuite), allowing the following:

  • Fit a Conditional Random Field model (1st-order linear-chain Markov)
  • Use the model to get predictions alongside the model on new data
  • The focus of the implementation is in the area of Natural Language Processing where this R package allows you to easily build and apply models for named entity recognition, text chunking, part of speech tagging, intent recognition or classification of any category you have in mind.

For users unfamiliar with Conditional Random Field (CRF) models, you can read this excellent tutorial http://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf

“a small web application is included in the package to allow you to easily construct training data”

The system setup:

  1. You need to setup rJava.
  2. Within RStudio, install the developer version by: devtools::install_github("bnosac/crfsuite", build_vignettes = TRUE)
  3. Load the package by: library(crfsuite)

Now you need some text data. I built an R package called textSummary and it comes with some text verbatim. So I am using that to annotate, POS tagging and then training the model. You can install and load the package from gitHub.

The process:

  1. You need to have doc_id column within your dataset.
  2. Save the dataset in rds format because the shinyApp expects this format
  3. Load the rds file within the shinyApp to annotate
  4. Once you have the annotated file then do POS tagging either using UDPipe or some other package like RDRPOSTagger which you can install from cran
  5. Merge the annotated and tokenised datasets
  6. bind the crf attributes with the merged dataset.
  7. Train the model
  8. Test the model using predict method

Let’s get started:

step 1: Getting the data

library(textSummary)
library(tidyverse)

verbatims <- verbatim

 

As mentioned above, you need to have doc_id in the dataset which you can add by:

verbatims <- verbatims %>% mutate(doc_id = row_number()) %>% select(doc_id,text)

#save it as RDS to be loaded on to the shiny app for annotation
saveRDS(verbatims,"verbatim_to_annotate.rds")

#Now start the app by:
rmarkdown::run(system.file(package = "crfsuite", "app", "annotation.Rmd"))

The app interface will look like this:

 

Here I uploaded the text data to annotate. Once the data is uploaded you will get the confirmation as seen above. You can add your own categories using the Manage tagging categories tab. Then click on Annotate Tab to begin annotation. Once you are done with annotating a particular verbatim/sentence/text part then click Annotate another document button to move to the next document.

The good news is that you don’t have to annotate the whole dataset. If you close the app. It will save the file with the annotations you have already completed. You can see a preview of the annotations you have completed by clicking on the Show results tab:

Once you close the app, it will create a new file within your working directory called: crfsuite_annotation_verbatim_to_annotate.rds You can load this file by simply clicking on the file within RStudio or by executing the command: crfsuite_annotation_verbatim_to_annotate <- readRDS("~/Documents/R Projects/crfsuite/crfsuite_annotation_verbatim_to_annotate.rds")

Now that you have your annotated file. We need to move to the next phase where we need to have tokenised version of our text file with POS (parts of speech) tagging.

I tried two methods: one is using the RDRPOSTagger R package and the other one is by using the UDPipe package. I had better success with UDPipe package for tokenisation and POS tagging than RDRPOSTagger. TheRDRPOSTagger doesn’t append start and end cols which are essential for merge  and crf bind attributes and therefore introduces duplicates and a weird errors and behaviours which I will discuss later. So now let’s start by UDPipe method to tokenise.

Why UDPipe?

This package comes with pre-trained models for POS and sentence tokenisation. To build crf models you need a model either trained by yourself or use pre-trained ones supplied by UDPipe community that can do POS tagging, sentence breaking etc. I am using an english model because the data is in english. You can either download these models in other languages too.

Getting the pre-trained models:

Here is how you could download and load the model:

udmodel <- udpipe_download_model("english-partut")
udmodel <- udpipe_load_model(udmodel$file_model)

if you’ve already downloaded a model then load the model by specifying the model file name as:

udmodel <- udpipe_load_model("english-partut-ud-2.0-170801.udpipe")

I tested 3 english models named: english, english-partut and english-lines. I liked using english-partut because it also provides lemmas when performing POS and tokenisation.

However, I ended up using the following model on my verbatims dataset (remember, your verbatim/text dataset needs to have doc_id and text cols) because it will tokenise with a commercially fine model which is downloaded from here: https://github.com/bnosac/udpipe.models.ud:

verbatims_tokens <- udpipe(verbatims, "english", udpipe_model_repo = "bnosac/udpipe.models.ud")

The above line will automatically do the sentence breaking and POS and lemmatisation. However, you can do this using your own downloaded model by:

verbatims <- unique(crfsuite_annotation_verbatim_to_annotate[, c("doc_id", "text")])

#you can use the above downloaded model to annotate i.e. POS, sentence breaking etc.
verbatim_tokens <- udpipe_annotate(udmodel,
x = verbatims$text,
doc_id = verbatims$doc_id)
class(verbatim_tokens)

verbatim_tokens <- as.data.frame(verbatim_tokens)

Now we need to do two more steps before we are ready to train our model.

  1. Merge the annotated and tokenised datasets
  2. Attach crf attributes

This is how could achieve these two steps:

x <- merge(crfsuite_annotation_verbatim_to_annotate, verbatim_tokens)

x <- crf_cbind_attributes(x, terms = c("upos", "lemma"), by = "doc_id")

Now you are ready to build the model. By default, the CRF model is trained using L-BFGS with L1/L2 regularization but other training methods are also available, namely: SGD with L2-regularization / Averaged Perceptron / Passive Aggressive or Adaptive Regularization of Weights).

In the below example we use the default parameters and decrease the iterations a bit to have a model ready quickly.

Provide the label with the categories (y) and the attributes of the observations (x) and indicate what is the sequence group (in this case we take doc_id).

The model will be saved to your working directory as tagger.crfsuite


model <- crf(y = x$chunk_entity,
x = x[, grep("upos|lemma", colnames(x), value = TRUE)],
group = x$doc_id,
method = "lbfgs",file = "tagger.crfsuite",
options = list(max_iterations = 15))

stats <- summary(model)

Check the loss evaluation by:

plot(stats$iterations$loss, pch = 20, type = "b", 
     main = "Loss evolution", xlab = "Iteration", ylab = "Loss")

Use the model

Now it’s time to test the model and get our predictions of the named entity / chunks / categories you have trained.

This should be done on the training/ holdout data. Provide the model, your data with the attributes and indicate the group the attributes belong to.

scores <- predict(model, 
                  newdata = crf_test[, c("pos", "pos_previous", "pos_next", 
                                         "token", "token_previous", "token_next")], 
                  group = crf_test$doc_id)

#get the labels by the model
crf_test$entity <- scores$label
#Build the confusion matrix:
table(crf_test$entity, crf_test$label)

That’s it you are done now! You should play with other tweaking arguments when training the model to see if it could improve the accuracy of your model.

Using RDRPOSTagger R package to tokenise and POS:

The purpose for this method was to test if CRFSuite can handle other types of POS using different R Packages. I ran in to all sorts of errors  and weird outputs which pointed to the this line:

x <- merge(crfsuite_annotation_verbatim_to_annotate, verbatim_tokens) assumes really that you have the fields start and end in the verbatim_tokens. You need to use verbatim_tokens <- as.data.frame(verbatim_tokens, detailed = TRUE) to get this. Or just use udpipe(verbatims, udmodel)

Install and load the package using:

devtools::install_github("bnosac/RDRPOSTagger", build_vignettes = TRUE)

library(RDRPOSTagger)

Find out available languages and models:

rdr_available_models()

Define the language and the type of tagger:

tagger <- RDRPOSTagger::rdr_model(language = "English",annotation = "UniversalPOS")

POS and tokenise text by:

rdr_tagging <- RDRPOSTagger::rdr_pos(object = tagger,x = verbatims$text)

Note: I am using the same unique doc_id with text as above.

The output adds “d” to to the doc_id so need to strip it:

rdr_tagging <- rdr_tagging %>% mutate(doc_id = as.integer(readr::parse_number(doc_id))) %>% rename(upos = pos) 

Here is the output:

Here is the comparison between UDPipe and RDRPOSTagger:

 

I liked RDRPOSTagger package more than UDPipe because it could detect better numbers i.e. UDPipe missed letters one as in number category whereas this package was able to extract it. Here is where trouble started: when I tried to merge the two datasets i.e. annotated one which we got using the ShinyApp and the POS dataset we just created. When I run the merge command I get the following error:
Error in merge.chunkrange(crfsuite_annotation_verbatim_to_annotate, rdr_tagging) : 
  all(c(by.y, "start", "end") %in% colnames(y)) is not TRUE
but when I look at verbatim_tokens variable that was created using UDPipe model, that doesn’t have those cols either. Therefore I just flipped the places as annotated rds dataset does have those cols. Error disappears if changed position:
x <- merge(rdr_tagging,crfsuite_annotation_verbatim_to_annotate)
Remember we don’t have lemmas so had to adjust the crf bind attributes slightly:
x <- crf_cbind_attributes(x, terms = c("upos"), by = "doc_id")
I noticed that when creating the x variable for UDPipe using the crf_cbind_attribute method, it creates a total of 68 cols. whereas here it just created 36 cols.  And this also creates duplicates as shown:

I dedup using:

x <- x %>% distinct(doc_id,token_id,.keep_all = TRUE)

 

Now build the model:

model <- crf(y = x$chunk_entity,
x = x[, grep("upos|lemma", colnames(x), value = TRUE)],
group = x$doc_id,
method = "lbfgs", options = list(max_iterations = 15))

stats <- summary(model)

This is the comparison between the two models:

The biggest difference I saw was in the chunk_entity col between the two methods was that the first method automatically prefixes annotated categories with I, B, O. For example when I created the x variable for the first method, It created the chunks_entity like this:
But doing the same for RDRPOSTagger POS it created the chunk_entity like this:

So for now I will stick to using the first method of POS utilising the UDPipe models for tokenisation and POS prior to merging and binding crf attributes.

Hope this helps!

Leave a Reply

Close Menu