Python is great for a lot of things in computational linguistics, but there are times when you run into a dead end.
Let's look at tokenization - the segmentation of text into words, or tokens.
As an example, the expected tokenization of the string
World!" would in most cases be
['Hello', ',', 'World', '!']. That's
what you get if you, for example, define a token as "a sequence of
alphanumeric characters or any other single non-whitespace
import re token = re.compile(r"\w+|\S")
Let's try it out:
>>> s = "Hello, World!" >>> token.findall(s) ['Hello', ',', 'World', '!']
You can get a long way with this approach, but sometimes it is easier
to define what separates the tokens, and chop up the text based on
that. Let's say a sequence of whitespace characters is always a
separator. Piece of cake! We can use
str.split() for that.
>>> s.split() ['Hello,', 'World!']
But wait, we wanted to cut off the punctuation characters, right? So
another boundary is the empty string that precedes the comma and the
str.split can't do that, so let's go back to the
re toolbox. First a direct translation of the
>>> sep = re.compile(r"\s+") >>> sep.split(s) ['Hello,', 'World!']
To match that empty string before punctuation characters, a fairly simple lookahead assertion comes in handy.
>>> sep = re.compile(r"\s+|(?=[,!])")
If you're not familiar with lookahead assertions, this pattern matches "a sequence of one or more whitespace characters or an empty string (but only if that empty string is followed by a comma or an exclamation point)". So let's try it out:
>>> sep.split(s) ['Hello,', 'World!']
What now? That didn't change the result a bit. So is there something wrong with the regular expression?
>>> len(sep.findall(s)) 3
There are three matches. Let's se where they are, just to be sure:
>>> sep.sub("-", s) 'Hello-,-World-!'
Okay, so the separators are there, but two of them are silently ignored
split. How rude.
I'm not sure why, but this is simply how
split was designed. Zero-length
matches are always skipped when searching for separators to split the
string by. Try the minimal example:
>>> re.sub(r"", "-", "foo") '-f-o-o-' >>> re.split(r"", "foo") ['foo']
Four matches, but no split.
I can't think of a use case where you'd simply want to split by the
str.split will even raise a
ValueError if you try,
but with regular expressions, you have both explicit assertions like
above, and special zero-width assertion metacharacters like
\b. There are definitely use cases for these and it's awkward that
re module limits their use.
I'll see if I can't patch this up somehow.