Text-safe XML processing with iterparse

By Filip Salomonsson; published on May 10, 2009. Tags: elementtree lxml python xml

The ElementTree API makes XML processing in Python a breeze, and the iterparse function alone can probably handle 80% of your XML processing needs. I love it.

But did you know you can lose data with it if you're not careful?

Don't worry - it's not a bug, but there are edge cases you should be aware of.

The problem

The documentation is clear:

iterparse() only guarantees that it has seen the ">" character of a starting tag when it emits a "start" event, so the attributes are defined, but the contents of the text and tail attributes are undefined at that point. The same applies to the element children; they may or may not be present.

If you need a fully populated element, look for "end" events instead.

As a rule, you should only use start events to inspect and/or modify the element's tag and its attributes.

You probably knew that already.

If you follow the link from Fredrik Lundh's iterparse page to a python-sig message from 2005, you'll see something that may not be as well known: the availability of the tail attribute during end events isn't guaranteed either.

You may not have known that.

The suggested remedy for the text attribute is simple: only touch it on end events. In most cases, you never even look at start events anyway, so that's a fine solution.

But what about tail? It's very rare that I ever use xml documents that has tail data, but when I do, this is an important issue. To be sure not to lose data, you'll have do something about it.

Luckily, there's a simple solution, but first, let's look at why this happens.

The cause

It all has to do with how the parsing works.

iterparse feeds data to the parser in 16-kilobyte chunks, and it fires off all events it can for each chunk. Then the events are handed over to you, one by one.

Say there's a foo element whose contents is the text "hello".

...<foo>hello</foo>...

As long as all of the text is in the same chunk as the preceeding ">", the text attribute will be set during the start event. We can try it out:

>>> import xml.etree.cElementTree as etree
>>> from cStringIO import StringIO
>>> doc = StringIO("<doc><foo>hello</foo></doc>")
>>> for event, elem in etree.iterparse(doc, ("start", "end")):
...     print event, elem.tag, elem.text or ""
... 
start doc
start foo hello
end foo hello
end doc

On the other hand, if a chunk ends in the middle of that text (or immediately after the start tag, before the text), iterparse will hand you a start event for the foo element without the text attribute set, and the parser comes back and sets it when it's processing the next chunk and reaches the end of the element.

          |
...<foo>he|llo</foo>...
          |

Let's trigger this by adding a long comment before the foo element.

>>> padding = "x" * 16365
>>> doc2 = StringIO("<doc><!--%s--><foo>hello</foo></doc>" % padding)
>>> for event, elem in etree.iterparse(doc2, ("start", "end")):
...     print event, elem.tag, elem.text or ""
... 
start doc
start foo
end foo hello
end doc

Now the chunk ends after "he", and this time the foo element's text attribute isn't set during the start event.

The issue with tail is exactly the same. We can trigger this by using a long comment again. This time, we'll use an empty foo element. The first chunk now ends after the "h" in "hello".

>>> doc3 = StringIO("<doc><!--%s--><foo/>hello</doc>" % padding)
>>> for event, elem in etree.iterparse(doc3, ("start", "end")):
...     print event, elem.tag, elem.tail or ""
... 
start doc 
start foo 
end foo 
end doc

No tail text to be seen.

The solution

Both text and tail data ends when another start or end tag occurs. Both of these trigger new events, so we can use a wrapper that stays one step ahead, making sure the next event has always been triggered before it let's us see the current one.

Here's our "delayed iterator":

def delayediter(iterable):
    iterable = iter(iterable)
    prev = iterable.next()
    for item in iterable:
        yield prev
        prev = item
    yield prev

Let's try it out on the last two examples above.

>>> doc2.seek(0) # "rewind" the stringio object
>>> context = etree.iterparse(doc2, ("start", "end"))
>>> for event, elem in delayediter(context):
...     print event, elem.tag, elem.text or ""
... 
start doc 
start foo hello
end foo hello
end doc 
>>> doc3.seek(0)
>>> context = etree.iterparse(doc3, ("start", "end"))
>>> for event, elem in delayediter(context):
...     print event, elem.tag, elem.tail or ""
... 
start doc 
start foo 
end foo hello
end doc

Success! This works both for Fredrik Lundh's ElementTree (which is in the standard library since python 2.5) and for Stefan Behnel's excellent lxml.

So, from no on, all your iterparsing should be text-safe. (With lxml, there are still special cases where this may not quite suffice, but we'll come back to that another time.) Happy coding!

Agree? Disagree? Found a bug? Talk back at filip.salomonsson@gmail.com.