I want my zero-width splits!
Python is great for a lot of things in computational linguistics, but there are times when you run into a dead end.
Let's look at tokenization - the segmentation of text into words, or tokens.
As an example, the expected tokenization of the string "Hello,
World!"
would in most cases be ['Hello', ',', 'World', '!']
. That's
what you get if you, for example, define a token as "a sequence of
alphanumeric characters or any other single non-whitespace
character".
import re
token = re.compile(r"\w+|\S")
Let's try it out:
>>> s = "Hello, World!"
>>> token.findall(s)
['Hello', ',', 'World', '!']
Excellent!
You can get a long way with this approach, but sometimes it is easier
to define what separates the tokens, and chop up the text based on
that. Let's say a sequence of whitespace characters is always a
separator. Piece of cake! We can use str.split()
for that.
>>> s.split()
['Hello,', 'World!']
But wait, we wanted to cut off the punctuation characters, right? So
another boundary is the empty string that precedes the comma and the
exclamation point. str.split
can't do that, so let's go back to the
re
toolbox. First a direct translation of the str.split
call:
>>> sep = re.compile(r"\s+")
>>> sep.split(s)
['Hello,', 'World!']
To match that empty string before punctuation characters, a fairly simple lookahead assertion comes in handy.
>>> sep = re.compile(r"\s+|(?=[,!])")
If you're not familiar with lookahead assertions, this pattern matches "a sequence of one or more whitespace characters or an empty string (but only if that empty string is followed by a comma or an exclamation point)". So let's try it out:
>>> sep.split(s)
['Hello,', 'World!']
What now? That didn't change the result a bit. So is there something wrong with the regular expression?
>>> len(sep.findall(s))
3
There are three matches. Let's se where they are, just to be sure:
>>> sep.sub("-", s)
'Hello-,-World-!'
Okay, so the separators are there, but two of them are silently ignored
by split
. How rude.
I'm not sure why, but this is simply how split
was designed. Zero-length
matches are always skipped when searching for separators to split the
string by. Try the minimal example:
>>> re.sub(r"", "-", "foo")
'-f-o-o-'
>>> re.split(r"", "foo")
['foo']
Four matches, but no split.
I can't think of a use case where you'd simply want to split by the
empty string. str.split
will even raise a ValueError
if you try,
but with regular expressions, you have both explicit assertions like
above, and special zero-width assertion metacharacters like $
and
\b
. There are definitely use cases for these and it's awkward that
the re
module limits their use.
I'll see if I can't patch this up somehow.