I want my zero-width splits!

By Filip Salomonsson; published on June 05, 2006.

Python is great for a lot of things in computational linguistics, but there are times when you run into a dead end.

Let's look at tokenization - the segmentation of text into words, or tokens.

As an example, the expected tokenization of the string "Hello, World!" would in most cases be ['Hello', ',', 'World', '!']. That's what you get if you, for example, define a token as "a sequence of alphanumeric characters or any other single non-whitespace character".

import re
token = re.compile(r"\w+|\S")

Let's try it out:

>>> s = "Hello, World!"
>>> token.findall(s)
['Hello', ',', 'World', '!']

Excellent!

You can get a long way with this approach, but sometimes it is easier to define what separates the tokens, and chop up the text based on that. Let's say a sequence of whitespace characters is always a separator. Piece of cake! We can use str.split() for that.

>>> s.split()
['Hello,', 'World!']

But wait, we wanted to cut off the punctuation characters, right? So another boundary is the empty string that precedes the comma and the exclamation point. str.split can't do that, so let's go back to the re toolbox. First a direct translation of the str.split call:

>>> sep = re.compile(r"\s+")
>>> sep.split(s)
['Hello,', 'World!']

To match that empty string before punctuation characters, a fairly simple lookahead assertion comes in handy.

>>> sep = re.compile(r"\s+|(?=[,!])")

If you're not familiar with lookahead assertions, this pattern matches "a sequence of one or more whitespace characters or an empty string (but only if that empty string is followed by a comma or an exclamation point)". So let's try it out:

>>> sep.split(s)
['Hello,', 'World!']

What now? That didn't change the result a bit. So is there something wrong with the regular expression?

>>> len(sep.findall(s))
3

There are three matches. Let's se where they are, just to be sure:

>>> sep.sub("-", s)
'Hello-,-World-!'

Okay, so the separators are there, but two of them are silently ignored by split. How rude.

I'm not sure why, but this is simply how split was designed. Zero-length matches are always skipped when searching for separators to split the string by. Try the minimal example:

>>> re.sub(r"", "-", "foo")
'-f-o-o-'
>>> re.split(r"", "foo")
['foo']

Four matches, but no split.

I can't think of a use case where you'd simply want to split by the empty string. str.split will even raise a ValueError if you try, but with regular expressions, you have both explicit assertions like above, and special zero-width assertion metacharacters like $ and \b. There are definitely use cases for these and it's awkward that the re module limits their use.

I'll see if I can't patch this up somehow.