I've been using ElementTree for about a year and a half, and I am madly in love with it.
I'm not one of those XML wonks who cheeringly throw angle brackets around whenever they get the chance; my relationship with XML consists mostly of parsing existing documents and doing fun stuff with what they hold.
How do I love thee? Let me count the ways..
Whether I'm extracting data from several-hundred-megabyte linguistically annotated corpora or headers from a XHTML document or whatnot, there are basically three things, or groups of things, in ElementTree that really turns my XML frown upside down.
iterparsefunction is absolutely wonderful. I would marry it if I could. For any kind of bulk processing - like extracting links from a web page or converting an entire corpus to a different format - it's often the only tool I need. Whenever I just need to do X for every element of type Y,
iterparsedoes it, and brilliantly so.
findallmethod and its
find*compadres support basic XPath expressions to select elements or groups of elements in a tree. When I need to be a little more picky about which elements to process, these guys will do the job.
- I used to hate processing HTML, simply because it's usually an utterly inprocessible pile of markup dung, but the
elementtidyhooks into Tidy and spits out a nice tree, pretty much no matter what kind of hellish HTML I throw at it.
It's really fast
I'm quite impressed by the speed of ElementTree (or, more specifically, cElementTree). Sure, when I'm racing through a 200 MB corpus, doing some actual processing of pretty much every single element, it takes more than a few seconds. But it's fast. Really fast.
In the web crawler I'm working on,
TidyHTMLTreeBuilder extracts URLs from up to a few thousand HTML documents per minute. Before I've even started wondering what it's doing, it's done.
Going the other way
Like I said, when it comes to XML, I much more of a consumer than a producer, but of course there are times when I need to churn out some of that extensible markup as well. And it's nothing short of a breeze with ElementTree. I toyed around with some RSS/Atom generation recently, and it was so easy I thought I was doing it wrong.
One thing amiss
I'm not sure I'm not just missing something really obvious, but I seem to miss a one-shot way to create elements with simple text contents, like
<title>Foo</title>. I always find myself adding a
TextElement function in such cases:
def TextElement(tag, text, *args, **kwargs): elem = Element(tag, *args, **kwargs) elem.text = text return elem
But I guess I can live with that.