Let the robotparser in
I'm writing a little web-crawling robot thingie, and I wanted to Be Good and support robots.txt
. So I pulled robotparser
out of the standard library hat:
>>> import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url("http://en.wikipedia.org/robots.txt")
>>> rp.read()
>>> rp.can_fetch("*", "http://en.wikipedia.org/wiki/Main_Page")
False
Hey, that's weird. Wikipedia is one of the sites I might want to crawl; I've looked at their robots.txt, and I should be allowed access to the main page. I tried with my specific user agent, just to be sure:
>>> rp.can_fetch("Creole/0.1a", http://en.wikipedia.org/wiki/Main_Page")
False
Still no luck. So what's going on? Next step: look at the parsed entries:
>>> rp.entries
[]
Aha! So something went absolutely boo-boo, and we ended up with no rule entries. If there is no robots.txt
file, can_fetch
will always return True, so there error must occur while fetching the file from the server. Fortunately, in such cases, the RobotFileParser
makes the error code available as an attribute, so I could simply have a look at that:
>>> rp.errcode
403
Now I was getting somewhere. In HTTP, 403 means "Forbidden". The server refuses to serve me the file I requested, for whatever reason.
The robot parser by default fetches robots.txt
using the default "Python-urllib" user agent string. Wikipedia seems to explicitly deny all access to such user agents (which is probably somewhat sensible). Thereby the 403 Forbidden. So I needed the robot parser to use my user agent string.
As a first solution, I simply fetched the robots.txt
file myself, and fed it to robotparser
s parse
method:
>>> import urllib2
>>> headers = {"User-Agent": "Creole/0.1a"}
>>> request = urllib2.Request("http://en.wikipedia.org/robots.txt",
... headers=headers)
>>> response = urllib2.urlopen(request)
>>> rp = robotparser.RobotFileParser()
>>> rp.parse(response.readlines())
>>> rp.can_fetch("*", "http://en.wikipedia.org/wiki/Main_Page")
True
Success! But, in a more generic setting (fetching from any site), I would have to add error handling. Being a lazy bum, I didn't like the thought of that, so I poked around a bit in robotparser.py
to see if there was another way. And sure enough, robotparser
uses a urllib.FancyURLopener
. URLopener
s have a version
attribute that you can override. So the final recipe became:
>>> import robotparser
>>> robotparser.URLopener.version = "Creole/0.1a"
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url("http://en.wikipedia.org/robots.txt")
>>> rp.read()
>>> rp.can_fetch("*", "http://en.wikipedia.org/wiki/Main_Page")
True
Now why didn't I think of that from the beginning?