HTML Parsing with lxml

By Filip Salo; published on June 08, 2006.

I haven't really bothered having a look at lxml, being quite happy with cElementTree. After reading this post by Ian Bicking, though, I'm going to have to take a closer look.

The TidyHTMLTreeBuilder that cElementTree provides is quite nice, but I do tend to not really want the extra processing that Tidy adds. lxml's HTML parser seems to focus more on what I want - going from HTML to an element tree.

Compare this:

>>> import cElementTree as etree
>>> from elementtidy import TidyHTMLTreeBuilder
>>> from cStringIO import StringIO
>>> html = "<html><head><title>Hello<body><H1>Hi!</h1>Foo<p>Bar<br>baz"
>>> tree = TidyHTMLTreeBuilder.parse(StringIO(html))
>>> etree.tostring(tree.getroot())
'<html:html xmlns:html="http://www.w3.org/1999/xhtml">\n<html:head>\n
<html:meta content="HTML Tidy for Linux/x86 (vers 1st July 2003), see
www.w3.org" name="generator" />\n<html:title>Hello</html:title>\n
</html:head>\n<html:body>\n<html:h1>Hi!</html:h1>\nFoo\n
<html:p>Bar<html:br />\nbaz</html:p>\n</html:body>\n</html:html>'

to this:

>>> from lxml import etree
>>> html = "<html><head><title>Hello<body><H1>Hi!</h1>Foo<p>Bar<br>baz"
>>> etree.tostring(etree.HTML(html))
'<html><head><title>Hello</title></head><body><h1>Hi!</h1>
<p>Foo</p><p>Bar<br/>baz</p></body></html>'

I think this would make some of the parsing I do in my web crawler a bit easier. I'll definitely look into it.