stevenloria.com

Tutorial: State-of-the-art Part-of-Speech Tagging in TextBlob

Sep 16, 2013


Following the tradition of writing a short tutorial with each new TextBlob release (0.6.3, changelog), here's an introduction to TextBlob's first outside code contribution from Matthew Honnibal, a.k.a. syllog1sm: a part-of-speech tagger based on the Averaged Perceptron algorithm which is faster and more accurate than NLTK's and pattern's default implementations.

Matthew Honnibal wrote a clear and detailed blog post about the Averaged Perception and his implementation here. For this reason, this post will focus on how to get and use the tagger without providing implementation details.

Getting the PerceptronTagger


UPDATE September 25, 2013: TextBlob 0.7.0 and the textblob-aptagger 0.1.0 extension are released. Instead of following the instructions below for getting the PerceptronTagger, just run

$ pip install -U textblob textblob-aptagger

UPDATE September 19, 2013: The installation process for the PerceptronTagger will be simplified in TextBlob 0.7.0 once the extension system is in place (should be released within the next couple of weeks). If you want to try it out early, install the dev version of TextBlob then install the textblob-aptagger extension here. Otherwise, TextBlob 0.6.3 users can use the instructions below.


First, upgrade to the latest version of TextBlob.

$ pip install -U textblob

The PerceptronTagger requires a trontagger.pickle file that is not included in the TextBlob distribution (in order to keep the distribution lightweight).

The file can be downloaded from TextBlob's Releases page on Github 1.

After downloading, the file, unzip it. On MacOSX, this can be done by double-clicking the file. Or you can use the shell:

$ gunzip trontagger.pickle.gz

You should now have trontagger.pickle. You need to put this in your TextBlob installation directory. To find where this is, run

$ python -c "import text; print(text.__path__[0])"

This will output the TextBlob directory. Place trontagger.pickle in this directory.

You're all set to use the tagger!

A Short Intro to the Blobber Class

Let's start tagging some text. To do this, you pass an instance of the tagger into the TextBlob constructor.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from text.blob import TextBlob as tb
from textblob_aptagger import PerceptronTagger

ap_tagger = PerceptronTagger()
# This is verbose; we'll see a DRYer version later
b1 = tb("Beautiful is better than ugly.", pos_tagger=ap_tagger)
b2 = tb("Simple is better than complex.", pos_tagger=ap_tagger)
print(b1.tags)
# [('Beautiful', u'NNP'), ('is', u'VBZ'), ('better', u'JJR'), ('than', u'IN'), ('ugly', u'RB')]
print(b2.tags)
# [('Simple', u'NN'), ('is', u'VBZ'), ('better', u'JJR'), ('than', u'IN'), ('complex', u'JJ')]

However, passing the tagger can get repetitive when making many TextBlobs. To avoid this, we can use the Blobber class, which is a "factory" that creates TextBlobs that share the same models. Let's rewrite the above code using a Blobber.

1
2
3
4
5
6
7
8
9
from text.blob import Blobber
from textblob_aptagger import PerceptronTagger

tb = Blobber(pos_tagger=PerceptronTagger())
b1 = tb("Beautiful is better than ugly.")
b2 = tb("Simple is better than complex")
print(b1.pos_tagger is b2.pos_tagger)  # True
print(b1.tags)
print(b2.tags)

Evaluating the Taggers

Now let's do a quick-and-dirty accuracy comparison of the Perceptron tagger with NLTK's and pattern's implementations.

The test data will be three tagged sentences (81 total words), stored as a list of lists.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
test = [[(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), (u',', u','), (u'61', u'CD'), 
            (u'years', u'NNS'), (u'old', u'JJ'), (u',', u','), (u'will', u'MD'), 
            (u'join', u'VB'), (u'the', u'DT'), (u'board', u'NN'), (u'as', u'IN'), 
            (u'a', u'DT'), (u'nonexecutive', u'JJ'), (u'director', u'NN'), 
            (u'Nov.', u'NNP'), (u'29', u'CD'), (u'.', u'.')], 
        [(u'Mr.', u'NNP'), (u'Vinken', u'NNP'), (u'is', u'VBZ'), (u'chairman', u'NN'), 
            (u'of', u'IN'), (u'Elsevier', u'NNP'), (u'N.V.', u'NNP'), (u',', u','), 
            (u'the', u'DT'), (u'Dutch', u'NNP'), (u'publishing', u'VBG'), 
            (u'group', u'NN'), (u'.', u'.'), (u'Rudolph', u'NNP'), (u'Agnew', u'NNP'), 
            (u',', u','), (u'55', u'CD'), (u'years', u'NNS'), (u'old', u'JJ'), 
            (u'and', u'CC'), (u'former', u'JJ'), (u'chairman', u'NN'), (u'of', u'IN'), 
            (u'Consolidated', u'NNP'), (u'Gold', u'NNP'), (u'Fields', u'NNP'), 
            (u'PLC', u'NNP'), (u',', u','), (u'was', u'VBD'), (u'named', u'VBN'), 
            (u'a', u'DT'), (u'nonexecutive', u'JJ'), (u'director', u'NN'), (u'of', u'IN'), 
            (u'this', u'DT'), (u'British', u'JJ'), (u'industrial', u'JJ'), 
            (u'conglomerate', u'NN'), (u'.', u'.')], 
        [(u'A', u'DT'), (u'form', u'NN'), 
            (u'of', u'IN'), (u'asbestos', u'NN'), (u'once', u'RB'), (u'used', u'VBN'), 
            (u'to', u'TO'), (u'make', u'VB'), (u'Kent', u'NNP'), (u'cigarette', u'NN'), 
            (u'filters', u'NNS'), (u'has', u'VBZ'), (u'caused', u'VBN'), (u'a', u'DT'), 
            (u'high', u'JJ'), (u'percentage', u'NN'), (u'of', u'IN'), 
            (u'cancer', u'NN'), (u'deaths', u'NNS'),
            (u'among', u'IN'), (u'a', u'DT'), (u'group', u'NN'), (u'of', u'IN'), 
            (u'workers', u'NNS'), (u'exposed', u'VBN'), (u'to', u'TO'), (u'it', u'PRP'), 
            (u'more', u'RBR'), (u'than', u'IN'), (u'30', u'CD'), (u'years', u'NNS'), 
            (u'ago', u'IN'), (u',', u','), (u'researchers', u'NNS'), 
            (u'reported', u'VBD'), (u'.', u'.')]]

We then define an accuracy() method that is passed our test dataset and an instance of a tagger.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import string
from text.blob import Blobber
from text.taggers import PatternTagger, NLTKTagger
from textblob_aptagger import PerceptronTagger

def accuracy(test_set, tagger):
    n_correct = 0
    total = 0
    tb = Blobber(pos_tagger=tagger)
    for tagged_sentence in test_set:
        # Get the untagged sentence string
        # e.g. "Pierre Vinken , 61 years old , will join the board ..."
        raw_sentence = ' '.join([word for word, tag in tagged_sentence])
        blob = tb(raw_sentence)  # Create a blob that uses the specified tagger
        # tagger excludes punctuation by default
        tags = [tag for word, tag in blob.tags]
        # exclude punctuation in test data
        target_tags = [tag for word, tag in tagged_sentence 
                       if tag not in string.punctuation]
        total += len(tags)
        # Add the number of correct tags
        n_correct += sum(1 for i in range(len(tags)) if tags[i] == target_tags[i])
    return float(n_correct) / total  # The accuracy

We can then get the accuracy of each tagger.

1
2
3
print(accuracy(test, PerceptronTagger()))
print(accuracy(test, NLTKTagger()))
print(accuracy(test, PatternTagger()))

Full script is here.

Results

Tagger Accuracy
PerceptronTagger 98.8%
NLTKTagger 94.0%
PatternTagger 91.6%

You can find more extensive evaluations of these three taggers at Matthew's blog post. The numbers are quite impressive. Thank you Matthew for all your hard work on this!

Further Reading


  1. Hosting supplemental models and data on the Releases page is not an ideal solution for the long-term because Github imposes a 5MB limit on file uploads, but it will work for now. If you have a suggestion for a better solution, please join the discussion here

tagged in programming python textblob nlp

Please send comments by email. I welcome your feedback, advice, and criticism.