Loving ElementTree
I've been using ElementTree for about a year and a half, and I am madly in love with it.
I'm not one of those XML wonks who cheeringly throw angle brackets around whenever they get the chance; my relationship with XML consists mostly of parsing existing documents and doing fun stuff with what they hold.
How do I love thee? Let me count the ways..
Whether I'm extracting data from several-hundred-megabyte linguistically annotated corpora or headers from a XHTML document or whatnot, there are basically three things, or groups of things, in ElementTree that really turns my XML frown upside down.
- The
iterparse
function is absolutely wonderful. I would marry it if I could. For any kind of bulk processing - like extracting links from a web page or converting an entire corpus to a different format - it's often the only tool I need. Whenever I just need to do X for every element of type Y,iterparse
does it, and brilliantly so. - The
findall
method and itsfind*
compadres support basic XPath expressions to select elements or groups of elements in a tree. When I need to be a little more picky about which elements to process, these guys will do the job. - I used to hate processing HTML, simply because it's usually an utterly inprocessible pile of markup dung, but the
TidyHTMLTreeBuilder
inelementtidy
hooks into Tidy and spits out a nice tree, pretty much no matter what kind of hellish HTML I throw at it.
It's really fast
I'm quite impressed by the speed of ElementTree (or, more specifically, cElementTree). Sure, when I'm racing through a 200 MB corpus, doing some actual processing of pretty much every single element, it takes more than a few seconds. But it's fast. Really fast.
In the web crawler I'm working on, TidyHTMLTreeBuilder
extracts URLs from up to a few thousand HTML documents per minute. Before I've even started wondering what it's doing, it's done.
Going the other way
Like I said, when it comes to XML, I much more of a consumer than a producer, but of course there are times when I need to churn out some of that extensible markup as well. And it's nothing short of a breeze with ElementTree. I toyed around with some RSS/Atom generation recently, and it was so easy I thought I was doing it wrong.
One thing amiss
I'm not sure I'm not just missing something really obvious, but I seem to miss a one-shot way to create elements with simple text contents, like <title>Foo</title>
. I always find myself adding a TextElement
function in such cases:
def TextElement(tag, text, *args, **kwargs):
elem = Element(tag, *args, **kwargs)
elem.text = text
return elem
But I guess I can live with that.