A simple interactive shell for TnT
Every now and then, I need to tag a sentence with part-of-speech tags. Typically, I need a short example for a lecture, but don't want to just yank a sentence from a corpus. So I need to either tag it manually or use a tagger - usually Thorsten Brants' TnT (Trigrams'n' Tags).
I am lazy, tagging can be tricky and the tagsets aren't always very intuitive, so it should come as no surprise I prefer the TnT way.
TnT wants its input data to be in a vertical tab-separated format. Basically, one token on each line, with sentences separated by a blank line:
This
is
an
example
.
This
is
another
...
(POS tags are then inserted after each token, with a tab in between.)
Preparing and mini corpus like this and running TnT manually every time the need arises quickly gets a bit tedious, so I wrote a simple interactive shell to help me.
Let me walk you through it.
First, I set up the path to the TnT executable and to the language model I want TnT to use. (I'm using a trigram model built from about a million words of Swedish.)
TNT = '/path/to/tnt'
MODEL = '/path/to/my/model'
Boooring. On to the meaty stuff. The tokenizer is based on a simple regular expression:
tokenizer = re.compile(r'\w+|\S', re.UNICODE)
Not very fancy at all, but definitely sufficient for trivial cases.
>>> tokenizer.findall(u"Foo, on you too!")
[u'Foo', u',', u'on', u'you', u'too', u'!']
Excellent.
To get data to and from TnT, we start a subprocess and communicate
with it using pipes. Unsurprisingly, the subprocess
module comes in
handy for this.
tnt = subprocess.Popen([TNT, '-v0', MODEL, '-'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
(The -v0
flag sets TnT's verbosity level to "silent", skipping
progress and header info in the output.)
Now, let's get a sentence from the user (c'est moi),
sentence = unicode(raw_input(">>> "), sys.stdin.encoding)
tokenize it,
tokens = tokenizer.findall(sentence)
and write the tokens to TnT's standard input, followed by a blank line.
for token in tokens:
tnt.stdin.write(token.encode("utf-8") + '\n')
tnt.stdin.write('\n')
Note that the internal processing is all done in unicode. The data is converted from whatever encoding stdin
uses, and written to TnT as utf-8, since that's how my model was encoded.
At this point, TnT will notice we've fed it a whole sentence, do its black voodoo magic and spit the sentence back out with the proper tags. That means it's time to read the result, the same number of lines as we just wrote, splitting each line on whitespace to get the token/tag pair.
pairs = []
for _ in xrange(len(tokens) + 1):
line = tnt.stdout.readline()
pairs.append(line.split())
And that's pretty much it. We can now print the output back to the user in a simple slash-separated horizontal fashion.
print " ".join(["/".join(pair) for pair in pairs])
That's just one sentence tagged, though, so we wrap this up in a loop and voila - our interactive shell is done. Here's the full code:
#!/usr/bin/env python
import re
import sys
import readline
import subprocess
TNT = '/path/to/tnt'
MODEL = '/path/to/model'
tokenizer = re.compile(r'\w+|\S', re.UNICODE)
tnt = subprocess.Popen([TNT, '-v0', MODEL, '-'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
print "Enter a sentence to tag it, or press Enter to quit."
while True:
try:
sentence = unicode(raw_input(">>> "), sys.stdin.encoding)
if not sentence:
break
tokens = tokenizer.findall(sentence)
for token in tokens:
tnt.stdin.write(token.encode("utf-8") + '\n')
tnt.stdin.write('\n')
pairs = []
for n in xrange(len(tokens) + 1):
line = tnt.stdout.readline()
pairs.append(line.split())
print " ".join(["/".join(pair) for pair in pairs])
except EOFError:
print
break
And here's what it looks like in action.
$ python tnt.py
Enter a sentence to tag it, or press Enter to quit.
>>> Hej, jag heter Filip!
Hej/I ,/FI jag/PF@USS@S heter/V@IPAS Filip/NP00N@0S !/FE
>>> Filip är bäst.
Filip/NP00N@0S är/V@IPAS bäst/AQS00NIS ./FE
>>>
$
It's all good. A neat extension of this would be to build a tnt
module that can be plugged into any application that needs tagging capabilities. Perhaps some other time, though. I'm going for a walk.