Let the robotparser in | infix.se

Let the robotparser in

By Filip Salomonsson; published on May 17, 2006.

I'm writing a little web-crawling robot thingie, and I wanted to Be Good and support robots.txt. So I pulled robotparser out of the standard library hat:

>>> import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url("http://en.wikipedia.org/robots.txt")
>>> rp.read()
>>> rp.can_fetch("*", "http://en.wikipedia.org/wiki/Main_Page")
False

Hey, that's weird. Wikipedia is one of the sites I might want to crawl; I've looked at their robots.txt, and I should be allowed access to the main page. I tried with my specific user agent, just to be sure:

>>> rp.can_fetch("Creole/0.1a", http://en.wikipedia.org/wiki/Main_Page")
False

Still no luck. So what's going on? Next step: look at the parsed entries:

>>> rp.entries
[]

Aha! So something went absolutely boo-boo, and we ended up with no rule entries. If there is no robots.txt file, can_fetch will always return True, so there error must occur while fetching the file from the server. Fortunately, in such cases, the RobotFileParser makes the error code available as an attribute, so I could simply have a look at that:

>>> rp.errcode
403

Now I was getting somewhere. In HTTP, 403 means "Forbidden". The server refuses to serve me the file I requested, for whatever reason.

The robot parser by default fetches robots.txt using the default "Python-urllib" user agent string. Wikipedia seems to explicitly deny all access to such user agents (which is probably somewhat sensible). Thereby the 403 Forbidden. So I needed the robot parser to use my user agent string.

As a first solution, I simply fetched the robots.txt file myself, and fed it to robotparsers parse method:

>>> import urllib2
>>> headers = {"User-Agent": "Creole/0.1a"}
>>> request = urllib2.Request("http://en.wikipedia.org/robots.txt",
...                           headers=headers)
>>> response = urllib2.urlopen(request)
>>> rp = robotparser.RobotFileParser()
>>> rp.parse(response.readlines())
>>> rp.can_fetch("*", "http://en.wikipedia.org/wiki/Main_Page")
True

Success! But, in a more generic setting (fetching from any site), I would have to add error handling. Being a lazy bum, I didn't like the thought of that, so I poked around a bit in robotparser.py to see if there was another way. And sure enough, robotparser uses a urllib.FancyURLopener. URLopeners have a version attribute that you can override. So the final recipe became:

>>> import robotparser
>>> robotparser.URLopener.version = "Creole/0.1a"
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url("http://en.wikipedia.org/robots.txt")
>>> rp.read()
>>> rp.can_fetch("*", "http://en.wikipedia.org/wiki/Main_Page")
True

Now why didn't I think of that from the beginning?