The TidyHTMLTreeBuilder that cElementTree provides is quite nice, but I do tend to not really want the extra processing that Tidy adds. lxml's HTML parser seems to focus more on what I want - going from HTML to an element tree.
>>> import cElementTree as etree >>> from elementtidy import TidyHTMLTreeBuilder >>> from cStringIO import StringIO >>> html = "<html><head><title>Hello<body><H1>Hi!</h1>Foo<p>Bar<br>baz" >>> tree = TidyHTMLTreeBuilder.parse(StringIO(html)) >>> etree.tostring(tree.getroot()) '<html:html xmlns:html="http://www.w3.org/1999/xhtml">\n<html:head>\n <html:meta content="HTML Tidy for Linux/x86 (vers 1st July 2003), see www.w3.org" name="generator" />\n<html:title>Hello</html:title>\n </html:head>\n<html:body>\n<html:h1>Hi!</html:h1>\nFoo\n <html:p>Bar<html:br />\nbaz</html:p>\n</html:body>\n</html:html>'
>>> from lxml import etree >>> html = "<html><head><title>Hello<body><H1>Hi!</h1>Foo<p>Bar<br>baz" >>> etree.tostring(etree.HTML(html)) '<html><head><title>Hello</title></head><body><h1>Hi!</h1> <p>Foo</p><p>Bar<br/>baz</p></body></html>'
I think this would make some of the parsing I do in my web crawler a bit easier. I'll definitely look into it.