Every now and then, I need to tag a sentence with part-of-speech tags. Typically, I need a short example for a lecture, but don't want to just yank a sentence from a corpus. So I need to either tag it manually or use a tagger - usually Thorsten Brants' TnT (Trigrams'n' Tags).
I am lazy, tagging can be tricky and the tagsets aren't always very intuitive, so it should come as no surprise I prefer the TnT way.
TnT wants its input data to be in a vertical tab-separated format. Basically, one token on each line, with sentences separated by a blank line:
This is an example . This is another ...
(POS tags are then inserted after each token, with a tab in between.)
Preparing and mini corpus like this and running TnT manually every time the need arises quickly gets a bit tedious, so I wrote a simple interactive shell to help me.
Let me walk you through it.
First, I set up the path to the TnT executable and to the language model I want TnT to use. (I'm using a trigram model built from about a million words of Swedish.)
TNT = '/path/to/tnt' MODEL = '/path/to/my/model'
Boooring. On to the meaty stuff. The tokenizer is based on a simple regular expression:
tokenizer = re.compile(r'\w+|\S', re.UNICODE)
Not very fancy at all, but definitely sufficient for trivial cases.
>>> tokenizer.findall(u"Foo, on you too!") [u'Foo', u',', u'on', u'you', u'too', u'!']
To get data to and from TnT, we start a subprocess and communicate
with it using pipes. Unsurprisingly, the
subprocess module comes in
handy for this.
tnt = subprocess.Popen([TNT, '-v0', MODEL, '-'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
-v0 flag sets TnT's verbosity level to "silent", skipping
progress and header info in the output.)
Now, let's get a sentence from the user (c'est moi),
sentence = unicode(raw_input(">>> "), sys.stdin.encoding)
tokens = tokenizer.findall(sentence)
and write the tokens to TnT's standard input, followed by a blank line.
for token in tokens: tnt.stdin.write(token.encode("utf-8") + '\n') tnt.stdin.write('\n')
Note that the internal processing is all done in unicode. The data is converted from whatever encoding
stdin uses, and written to TnT as utf-8, since that's how my model was encoded.
At this point, TnT will notice we've fed it a whole sentence, do its black voodoo magic and spit the sentence back out with the proper tags. That means it's time to read the result, the same number of lines as we just wrote, splitting each line on whitespace to get the token/tag pair.
pairs =  for _ in xrange(len(tokens) + 1): line = tnt.stdout.readline() pairs.append(line.split())
And that's pretty much it. We can now print the output back to the user in a simple slash-separated horizontal fashion.
print " ".join(["/".join(pair) for pair in pairs])
That's just one sentence tagged, though, so we wrap this up in a loop and voila - our interactive shell is done. Here's the full code:
#!/usr/bin/env python import re import sys import readline import subprocess TNT = '/path/to/tnt' MODEL = '/path/to/model' tokenizer = re.compile(r'\w+|\S', re.UNICODE) tnt = subprocess.Popen([TNT, '-v0', MODEL, '-'], stdin=subprocess.PIPE, stdout=subprocess.PIPE) print "Enter a sentence to tag it, or press Enter to quit." while True: try: sentence = unicode(raw_input(">>> "), sys.stdin.encoding) if not sentence: break tokens = tokenizer.findall(sentence) for token in tokens: tnt.stdin.write(token.encode("utf-8") + '\n') tnt.stdin.write('\n') pairs =  for n in xrange(len(tokens) + 1): line = tnt.stdout.readline() pairs.append(line.split()) print " ".join(["/".join(pair) for pair in pairs]) except EOFError: print break
And here's what it looks like in action.
$ python tnt.py Enter a sentence to tag it, or press Enter to quit. >>> Hej, jag heter Filip! Hej/I ,/FI jag/PF@USS@S heter/V@IPAS Filip/NP00N@0S !/FE >>> Filip är bäst. Filip/NP00N@0S är/V@IPAS bäst/AQS00NIS ./FE >>> $
It's all good. A neat extension of this would be to build a
tnt module that can be plugged into any application that needs tagging capabilities. Perhaps some other time, though. I'm going for a walk.