What about now?

By Filip Salomonsson; published on November 25, 2006.

Previously: Is this anything?.

Nope, it wasn't. Perhaps this is. I fixed some bugs and added the ancestors thingie, which feels a bit kludgey at the moment. Maybe I should just stack them instead. Hmm. I don't know if I should really be doing this on a headache.

#!/usr/bin/env python2.5

from xml.etree.ElementTree import iterparse
from xml.etree.ElementPath import _compile, xpath_descendant_or_self

def iterfind(source, path):
    path = _compile(path).path
    tags = []
    ancestors = -1
    for event, elem in iterparse(source, ("start", "end")):
        match = _match_path(tags, path)
        if event == "end":
            if match:
                yield elem
                if not ancestors: elem.clear()
                ancestors -= 1
            if match: ancestors += 1

def _match_path(tags, path):
    if tags == path: return True
    if not tags or not path: return False
    if isinstance(path[0], xpath_descendant_or_self):
        return any(_match_path(tags[i:], path[1:])
                   for i in range(len(tags)))
        if path[0] in ("*", tags[0]):
            return _match_path(tags[1:], path[1:])

if __name__ == '__main__':
    # XX: Lame main procedure for my own testing
    import urllib2
    import sys
    from xml.etree.ElementTree import tostring

    data = urllib2.urlopen("http://infix.se/")
    for elem in iterfind(data, sys.argv[1]):
        print tostring(elem)

So it's kind of like iterparse, except you only get the elements that match a simple xpath expression - like the ones ElementTree can handle. Example:

$ ./iterfind.py './/h1/*'
<span>Is this anything?</span>
<span>Status: astray</span>
<span>I don't have an accent</span>
<span>PayPal, why do you hate me?</span>
<span>Killing time</span>
<span>Found elsewhere</span>
<span>Search this site</span>
<span>Subscribe by email</span>

I would like to have a fuller xpath implementation, though, to be able to do pick things like "all the a elements under the div with id="navigation". (It's not specifically for XHTML, though; it was just a convenient data set to test it on.) Maybe I should look into how lxml does it. Maybe it's overkill.