A simple interactive shell for TnT

By Filip Salomonsson; published on October 11, 2006.

Every now and then, I need to tag a sentence with part-of-speech tags. Typically, I need a short example for a lecture, but don't want to just yank a sentence from a corpus. So I need to either tag it manually or use a tagger - usually Thorsten Brants' TnT (Trigrams'n' Tags).

I am lazy, tagging can be tricky and the tagsets aren't always very intuitive, so it should come as no surprise I prefer the TnT way.

TnT wants its input data to be in a vertical tab-separated format. Basically, one token on each line, with sentences separated by a blank line:

This
is
an
example
.

This
is
another
...

(POS tags are then inserted after each token, with a tab in between.)

Preparing and mini corpus like this and running TnT manually every time the need arises quickly gets a bit tedious, so I wrote a simple interactive shell to help me.

Let me walk you through it.

First, I set up the path to the TnT executable and to the language model I want TnT to use. (I'm using a trigram model built from about a million words of Swedish.)

TNT = '/path/to/tnt'
MODEL = '/path/to/my/model'

Boooring. On to the meaty stuff. The tokenizer is based on a simple regular expression:

tokenizer = re.compile(r'\w+|\S', re.UNICODE)

Not very fancy at all, but definitely sufficient for trivial cases.

>>> tokenizer.findall(u"Foo, on you too!")
[u'Foo', u',', u'on', u'you', u'too', u'!']

Excellent.

To get data to and from TnT, we start a subprocess and communicate with it using pipes. Unsurprisingly, the subprocess module comes in handy for this.

tnt = subprocess.Popen([TNT, '-v0', MODEL, '-'],
                       stdin=subprocess.PIPE, stdout=subprocess.PIPE)

(The -v0 flag sets TnT's verbosity level to "silent", skipping progress and header info in the output.)

Now, let's get a sentence from the user (c'est moi),

sentence = unicode(raw_input(">>> "), sys.stdin.encoding)

tokenize it,

tokens = tokenizer.findall(sentence)

and write the tokens to TnT's standard input, followed by a blank line.

for token in tokens:
    tnt.stdin.write(token.encode("utf-8") + '\n')
tnt.stdin.write('\n')

Note that the internal processing is all done in unicode. The data is converted from whatever encoding stdin uses, and written to TnT as utf-8, since that's how my model was encoded.

At this point, TnT will notice we've fed it a whole sentence, do its black voodoo magic and spit the sentence back out with the proper tags. That means it's time to read the result, the same number of lines as we just wrote, splitting each line on whitespace to get the token/tag pair.

pairs = []
for _ in xrange(len(tokens) + 1):
    line = tnt.stdout.readline()
    pairs.append(line.split())

And that's pretty much it. We can now print the output back to the user in a simple slash-separated horizontal fashion.

print " ".join(["/".join(pair) for pair in pairs])

That's just one sentence tagged, though, so we wrap this up in a loop and voila - our interactive shell is done. Here's the full code:

#!/usr/bin/env python
import re
import sys
import readline
import subprocess

TNT = '/path/to/tnt'
MODEL = '/path/to/model'

tokenizer = re.compile(r'\w+|\S', re.UNICODE)
tnt = subprocess.Popen([TNT, '-v0', MODEL, '-'],
                       stdin=subprocess.PIPE, stdout=subprocess.PIPE)

print "Enter a sentence to tag it, or press Enter to quit."

while True:
    try:
        sentence = unicode(raw_input(">>> "), sys.stdin.encoding)

        if not sentence:
            break

        tokens = tokenizer.findall(sentence)

        for token in tokens:
            tnt.stdin.write(token.encode("utf-8") + '\n')
        tnt.stdin.write('\n')

        pairs = []
        for n in xrange(len(tokens) + 1):
            line = tnt.stdout.readline()
            pairs.append(line.split())

        print " ".join(["/".join(pair) for pair in pairs])

    except EOFError:
        print
        break

And here's what it looks like in action.

$ python tnt.py
Enter a sentence to tag it, or press Enter to quit.
>>> Hej, jag heter Filip!
Hej/I ,/FI jag/PF@USS@S heter/V@IPAS Filip/NP00N@0S !/FE 
>>> Filip är bäst.
Filip/NP00N@0S är/V@IPAS bäst/AQS00NIS ./FE 
>>> 
$

It's all good. A neat extension of this would be to build a tnt module that can be plugged into any application that needs tagging capabilities. Perhaps some other time, though. I'm going for a walk.