Loving ElementTree | infix.se

Loving ElementTree

By Filip Salomonsson; published on May 24, 2006.

I've been using ElementTree for about a year and a half, and I am madly in love with it.

I'm not one of those XML wonks who cheeringly throw angle brackets around whenever they get the chance; my relationship with XML consists mostly of parsing existing documents and doing fun stuff with what they hold.

How do I love thee? Let me count the ways..

Whether I'm extracting data from several-hundred-megabyte linguistically annotated corpora or headers from a XHTML document or whatnot, there are basically three things, or groups of things, in ElementTree that really turns my XML frown upside down.

  1. The iterparse function is absolutely wonderful. I would marry it if I could. For any kind of bulk processing - like extracting links from a web page or converting an entire corpus to a different format - it's often the only tool I need. Whenever I just need to do X for every element of type Y, iterparse does it, and brilliantly so.
  2. The findall method and its find* compadres support basic XPath expressions to select elements or groups of elements in a tree. When I need to be a little more picky about which elements to process, these guys will do the job.
  3. I used to hate processing HTML, simply because it's usually an utterly inprocessible pile of markup dung, but the TidyHTMLTreeBuilder in elementtidy hooks into Tidy and spits out a nice tree, pretty much no matter what kind of hellish HTML I throw at it.

It's really fast

I'm quite impressed by the speed of ElementTree (or, more specifically, cElementTree). Sure, when I'm racing through a 200 MB corpus, doing some actual processing of pretty much every single element, it takes more than a few seconds. But it's fast. Really fast.

In the web crawler I'm working on, TidyHTMLTreeBuilder extracts URLs from up to a few thousand HTML documents per minute. Before I've even started wondering what it's doing, it's done.

Going the other way

Like I said, when it comes to XML, I much more of a consumer than a producer, but of course there are times when I need to churn out some of that extensible markup as well. And it's nothing short of a breeze with ElementTree. I toyed around with some RSS/Atom generation recently, and it was so easy I thought I was doing it wrong.

One thing amiss

I'm not sure I'm not just missing something really obvious, but I seem to miss a one-shot way to create elements with simple text contents, like <title>Foo</title>. I always find myself adding a TextElement function in such cases:

def TextElement(tag, text, *args, **kwargs):
    elem = Element(tag, *args, **kwargs)
    elem.text = text
    return elem

But I guess I can live with that.