

By Fahad Usman
Introduction:
CRFsuite is an implementation of Conditional Random Fields algorithm.
This R package wraps the CRFsuite C/C++ library (https://github.com/chokkan/crfsuite), allowing the following:
- Fit a Conditional Random Field model (1st-order linear-chain Markov)
- Use the model to get predictions alongside the model on new data
- The focus of the implementation is in the area of Natural Language Processing where this R package allows you to easily build and apply models for named entity recognition, text chunking, part of speech tagging, intent recognition or classification of any category you have in mind.
For users unfamiliar with Conditional Random Field (CRF) models, you can read this excellent tutorial http://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf
“a small web application is included in the package to allow you to easily construct training data”
The system setup:
- You need to setup rJava.
- Within RStudio, install the developer version by:
devtools::install_github("bnosac/crfsuite", build_vignettes = TRUE)
- Load the package by:
library(crfsuite)
Now you need some text data
. I built an R package called textSummary
and it comes with some text verbatim. So I am using that to annotate, POS tagging
and then training
the model. You can install and load the package from gitHub.
The process:
- You need to have
doc_id
column within your dataset. - Save the dataset in
rds
format because the shinyApp expects this format - Load the rds file within the
shinyApp
to annotate - Once you have the annotated file then do
POS tagging
either using UDPipe or some other package likeRDRPOSTagger
which you can install fromcran
Merge
the annotated andtokenised
datasets- bind the
crf attributes
with the merged dataset. Train
the modelTest
the model usingpredict
method
Let’s get started:
step 1: Getting the data
library(textSummary)
library(tidyverse)
verbatims <- verbatim
As mentioned above, you need to have doc_id
in the dataset which you can add by:
verbatims <- verbatims %>% mutate(doc_id = row_number()) %>% select(doc_id,text)
#save it as RDS to be loaded on to the shiny app for annotation
saveRDS(verbatims,"verbatim_to_annotate.rds")
#Now start the app by:
rmarkdown::run(system.file(package = "crfsuite", "app", "annotation.Rmd"))
The app interface will look like this:

Here I uploaded the text data
to annotate
. Once the data is uploaded you will get the confirmation as seen above. You can add your own categories
using the Manage tagging categories
tab. Then click on Annotate
Tab to begin annotation. Once you are done with annotating a particular verbatim/sentence/text part
then click Annotate another document
button to move to the next document
.

annotate
the whole dataset
. If you close the app. It will save the file with the annotations you have already completed. You can see a preview of the annotations you have completed by clicking on the Show results
tab:
Once you close the app, it will create a new file within your working directory called: crfsuite_annotation_verbatim_to_annotate.rds
You can load this file by simply clicking on the file within RStudio or by executing the command: crfsuite_annotation_verbatim_to_annotate <- readRDS("~/Documents/R Projects/crfsuite/crfsuite_annotation_verbatim_to_annotate.rds")
Now that you have your annotated file. We need to move to the next phase where we need to have tokenised
version of our text file with POS (parts of speech) tagging
.
I tried two methods: one is using the RDRPOSTagger
R package and the other one is by using the UDPipe package
. I had better success with UDPipe
package for tokenisation
and POS tagging
than RDRPOSTagger
. TheRDRPOSTagger
doesn’t append start
and end cols
which are essential for merge
and crf bind attributes
and therefore introduces duplicates
and a weird errors and behaviours which I will discuss later. So now let’s start by UDPipe
method to tokenise
.
Why UDPipe?
This package comes with pre-trained models for POS and sentence tokenisation. To build crf models you need a model
either trained
by yourself or use pre-trained
ones supplied by UDPipe
community that can do POS tagging, sentence breaking
etc. I am using an english model
because the data is in english. You can either download these models in other languages too.
Getting the pre-trained models:
Here is how you could download and load the model:
udmodel <- udpipe_download_model("english-partut")
udmodel <- udpipe_load_model(udmodel$file_model)
if you’ve already downloaded a model then load the model by specifying the model file name as:
udmodel <- udpipe_load_model("english-partut-ud-2.0-170801.udpipe")
I tested 3 english models named: english, english-partut and english-lines
. I liked using english-partut
because it also provides lemmas
when performing POS
and tokenisation
.
However, I ended up using the following model on my verbatims dataset (remember, your verbatim/text dataset
needs to have doc_id
and text
cols) because it will tokenise
with a commercially fine model which is downloaded from here: https://github.com/bnosac/udpipe.models.ud:
verbatims_tokens <- udpipe(verbatims, "english", udpipe_model_repo = "bnosac/udpipe.models.ud")
The above line will automatically do the sentence breaking and POS and lemmatisation. However, you can do this using your own downloaded model by:
verbatims <- unique(crfsuite_annotation_verbatim_to_annotate[, c("doc_id", "text")])
#you can use the above downloaded model to annotate i.e. POS, sentence breaking etc.
verbatim_tokens <- udpipe_annotate(udmodel,
x = verbatims$text,
doc_id = verbatims$doc_id)
class(verbatim_tokens)
verbatim_tokens <- as.data.frame(verbatim_tokens)
Now we need to do two more steps before we are ready to train
our model
.
- Merge the
annotated
andtokenised
datasets - Attach
crf
attributes
This is how could achieve these two steps:
x <- merge(crfsuite_annotation_verbatim_to_annotate, verbatim_tokens)
x <- crf_cbind_attributes(x, terms = c("upos", "lemma"), by = "doc_id")
Now you are ready to build the model. By default, the CRF model is trained using L-BFGS with L1/L2 regularization but other training methods are also available, namely: SGD with L2-regularization / Averaged Perceptron / Passive Aggressive or Adaptive Regularization of Weights
).
In the below example we use the default parameters and decrease the iterations a bit to have a model ready quickly.
Provide the label
with the categories (y) and the attributes of the observations (x)
and indicate what is the sequence group (in this case we take doc_id)
.
The model will be saved to your working directory as tagger.crfsuite
model <- crf(y = x$chunk_entity,
x = x[, grep("upos|lemma", colnames(x), value = TRUE)],
group = x$doc_id,
method = "lbfgs",file = "tagger.crfsuite",
options = list(max_iterations = 15))
stats <- summary(model)
Check the loss evaluation by:
plot(stats$iterations$loss, pch = 20, type = "b",
main = "Loss evolution", xlab = "Iteration", ylab = "Loss")
Use the model
Now it’s time to test the model and get our predictions of the named entity / chunks / categories you have trained.
This should be done on the training/ holdout data. Provide the model, your data with the attributes and indicate the group the attributes belong to.
scores <- predict(model,
newdata = crf_test[, c("pos", "pos_previous", "pos_next",
"token", "token_previous", "token_next")],
group = crf_test$doc_id)
#get the labels by the model
crf_test$entity <- scores$label
#Build the confusion matrix:
table(crf_test$entity, crf_test$label)
That’s it you are done now! You should play with other tweaking arguments when training the model to see if it could improve the accuracy of your model.
Using RDRPOSTagger R package to tokenise and POS:
The purpose for this method was to test if CRFSuite can handle other types of POS using different R Packages. I ran in to all sorts of errors and weird outputs which pointed to the this line:
x <- merge(crfsuite_annotation_verbatim_to_annotate, verbatim_tokens)
assumes really that you have the fields start and end in the verbatim_tokens
. You need to use verbatim_tokens <- as.data.frame(verbatim_tokens, detailed = TRUE)
to get this. Or just use udpipe(verbatims, udmodel)
Install and load the package using:
devtools::install_github("bnosac/RDRPOSTagger", build_vignettes = TRUE)
library(RDRPOSTagger)
Find out available languages and models:
rdr_available_models()
Define the language and the type of tagger:
tagger <- RDRPOSTagger::rdr_model(language = "English",annotation = "UniversalPOS")
POS and tokenise text by:
rdr_tagging <- RDRPOSTagger::rdr_pos(object = tagger,x = verbatims$text)
Note: I am using the same unique doc_id with text as above.
The output adds “d” to to the doc_id so need to strip it:
rdr_tagging <- rdr_tagging %>% mutate(doc_id = as.integer(readr::parse_number(doc_id))) %>% rename(upos = pos)
Here is the output:

Here is the comparison between UDPipe and RDRPOSTagger:

RDRPOSTagger
package more than UDPipe
because it could detect better numbers i.e. UDPipe
missed letters one
as in number category
whereas this package was able to extract it.
Here is where trouble started: when I tried to merge
the two datasets i.e. annotated
one which we got using the ShinyApp
and the POS dataset
we just created. When I run the merge
command I get the following error:
Error in merge.chunkrange(crfsuite_annotation_verbatim_to_annotate, rdr_tagging) :
all(c(by.y, "start", "end") %in% colnames(y)) is not TRUE
but when I look at verbatim_tokens
variable that was created using UDPipe model
, that doesn’t have those cols
either. Therefore I just flipped the places as annotated rds dataset
does have those cols
.
Error disappears if changed position:
x <- merge(rdr_tagging,crfsuite_annotation_verbatim_to_annotate)
Remember we don’t have lemmas so had to adjust the crf bind attributes
slightly:
x <- crf_cbind_attributes(x, terms = c("upos"), by = "doc_id")
I noticed that when creating the x
variable for UDPipe
using the crf_cbind_attribute
method, it creates a total of 68 cols
. whereas here it just created 36 cols
. And this also creates duplicates
as shown:
I dedup
using:
x <- x %>% distinct(doc_id,token_id,.keep_all = TRUE)
Now build the model:
model <- crf(y = x$chunk_entity,
x = x[, grep("upos|lemma", colnames(x), value = TRUE)],
group = x$doc_id,
method = "lbfgs", options = list(max_iterations = 15))
stats <- summary(model)
This is the comparison between the two models:

chunk_entity col
between the two methods was that the first method automatically prefixes
annotated categories with I, B, O
. For example when I created the x variable
for the first method, It created the chunks_entity
like this:
RDRPOSTagger POS
it created the chunk_entity
like this:
So for now I will stick to using the first method of POS
utilising the UDPipe models
for tokenisation
and POS
prior to merging
and binding crf attributes
.
Hope this helps!
Isaac
28 May 2019Thanks for this excellent tutorial, just what I needed
fahadshery
8 Mar 2020my pleasure. glad you liked it 🙂
ปั้มไลค์
17 Jul 2020Like!! Thank you for publishing this awesome article.