Message 187560 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lukasz.langa
Recipients	acooke, benmezger, ezio.melotti, lukasz.langa, mher, r.david.murray
Date	2013-04-22.13:16:23
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1366636584.55.0.57701637202.issue17403@psf.upfronthosting.co.za>
In-reply-to

Content
robotparser implements http://www.robotstxt.org/orig.html, there's even a link to this document at http://docs.python.org/3/library/urllib.robotparser.html. As mher points out, there's a newer version of that spec formed as RFC: http://www.robotstxt.org/norobots-rfc.txt. It introduces Allow, specifies how percentage encoding should be treated and how to handle expiration. Moreover, there is a de facto standard agreed by Google, Yahoo and Microsoft in 2008, documented by their respective blog posts: http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html http://www.ysearchblog.com/2008/06/03/one-standard-fits-all-robots-exclusion-protocol-for-yahoo-google-and-microsoft/ http://www.bing.com/blogs/site_blogs/b/webmaster/archive/2008/06/03/robots-exclusion-protocol-joining-together-to-provide-better-documentation.aspx For reference, there are two third-party robots.txt parsers out there implementing these extensions: - https://pypi.python.org/pypi/reppy - https://pypi.python.org/pypi/robotexclusionrulesparser We need to decide how to incorporate those new features while maintaining backwards compatibility concerns.

robotparser implements http://www.robotstxt.org/orig.html, there's even a link to this document at http://docs.python.org/3/library/urllib.robotparser.html. As mher points out, there's a newer version of that spec formed as RFC: http://www.robotstxt.org/norobots-rfc.txt. It introduces Allow, specifies how percentage encoding should be treated and how to handle expiration.

Moreover, there is a de facto standard agreed by Google, Yahoo and Microsoft in 2008, documented by their respective blog posts:

http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html

http://www.ysearchblog.com/2008/06/03/one-standard-fits-all-robots-exclusion-protocol-for-yahoo-google-and-microsoft/

http://www.bing.com/blogs/site_blogs/b/webmaster/archive/2008/06/03/robots-exclusion-protocol-joining-together-to-provide-better-documentation.aspx

For reference, there are two third-party robots.txt parsers out there implementing these extensions:

- https://pypi.python.org/pypi/reppy
- https://pypi.python.org/pypi/robotexclusionrulesparser

We need to decide how to incorporate those new features while maintaining backwards compatibility concerns.

History
Date	User	Action	Args
2013-04-22 13:16:24	lukasz.langa	set	recipients: + lukasz.langa, ezio.melotti, acooke, r.david.murray, mher, benmezger
2013-04-22 13:16:24	lukasz.langa	set	messageid: <1366636584.55.0.57701637202.issue17403@psf.upfronthosting.co.za>
2013-04-22 13:16:24	lukasz.langa	link	issue17403 messages
2013-04-22 13:16:23	lukasz.langa	create