What about now?
Previously: Is this anything?.
Nope, it wasn't. Perhaps this is. I fixed some bugs and added the ancestors
thingie, which feels a bit kludgey at the moment. Maybe I should just stack them instead. Hmm. I don't know if I should really be doing this on a headache.
#!/usr/bin/env python2.5
from xml.etree.ElementTree import iterparse
from xml.etree.ElementPath import _compile, xpath_descendant_or_self
def iterfind(source, path):
path = _compile(path).path
tags = []
ancestors = -1
for event, elem in iterparse(source, ("start", "end")):
match = _match_path(tags, path)
if event == "end":
if match:
yield elem
if not ancestors: elem.clear()
ancestors -= 1
tags.pop()
else:
tags.append(elem.tag)
if match: ancestors += 1
def _match_path(tags, path):
if tags == path: return True
if not tags or not path: return False
if isinstance(path[0], xpath_descendant_or_self):
return any(_match_path(tags[i:], path[1:])
for i in range(len(tags)))
else:
if path[0] in ("*", tags[0]):
return _match_path(tags[1:], path[1:])
if __name__ == '__main__':
# XX: Lame main procedure for my own testing
import urllib2
import sys
from xml.etree.ElementTree import tostring
data = urllib2.urlopen("http://infix.se/")
for elem in iterfind(data, sys.argv[1]):
print tostring(elem)
So it's kind of like iterparse, except you only get the elements that match a simple xpath expression - like the ones ElementTree can handle. Example:
$ ./iterfind.py './/h1/*'
<span>Is this anything?</span>
<span>Status: astray</span>
<span>I don't have an accent</span>
<span>PayPal, why do you hate me?</span>
<span>Killing time</span>
<span>Found elsewhere</span>
<span>Search this site</span>
<span>Subscribe by email</span>
<span>Photostream</span>
I would like to have a fuller xpath implementation, though, to be able to do pick things like "all the a
elements under the div
with id="navigation"
. (It's not specifically for XHTML, though; it was just a convenient data set to test it on.) Maybe I should look into how lxml does it. Maybe it's overkill.