Previously: Is this anything?.
Nope, it wasn't. Perhaps this is. I fixed some bugs and added the
ancestors thingie, which feels a bit kludgey at the moment. Maybe I should just stack them instead. Hmm. I don't know if I should really be doing this on a headache.
#!/usr/bin/env python2.5 from xml.etree.ElementTree import iterparse from xml.etree.ElementPath import _compile, xpath_descendant_or_self def iterfind(source, path): path = _compile(path).path tags =  ancestors = -1 for event, elem in iterparse(source, ("start", "end")): match = _match_path(tags, path) if event == "end": if match: yield elem if not ancestors: elem.clear() ancestors -= 1 tags.pop() else: tags.append(elem.tag) if match: ancestors += 1 def _match_path(tags, path): if tags == path: return True if not tags or not path: return False if isinstance(path, xpath_descendant_or_self): return any(_match_path(tags[i:], path[1:]) for i in range(len(tags))) else: if path in ("*", tags): return _match_path(tags[1:], path[1:]) if __name__ == '__main__': # XX: Lame main procedure for my own testing import urllib2 import sys from xml.etree.ElementTree import tostring data = urllib2.urlopen("http://infix.se/") for elem in iterfind(data, sys.argv): print tostring(elem)
$ ./iterfind.py './/h1/*' <span>Is this anything?</span> <span>Status: astray</span> <span>I don't have an accent</span> <span>PayPal, why do you hate me?</span> <span>Killing time</span> <span>Found elsewhere</span> <span>Search this site</span> <span>Subscribe by email</span> <span>Photostream</span>
I would like to have a fuller xpath implementation, though, to be able to do pick things like "all the
a elements under the
id="navigation". (It's not specifically for XHTML, though; it was just a convenient data set to test it on.) Maybe I should look into how lxml does it. Maybe it's overkill.