New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow unicode arguments for robotparser.can_fetch #42942
Comments
One-line summary: If the robotparser module encounters More detail: I recreated this with Python 2.4.1 on FreeBSD 6 and AddCharset iso-8859-1 .iso8859-1 A suggested solution: |
Logged In: YES To me, this is not a bug report but at best an RFE. The So I recomment closing this as a bug report but will give |
Logged In: YES Thanks for looking at this. I have some followup comments. The list at robotstxt.org is many years stale (note that And by Googling for some common non-ASCII words and letters Robots.txt files that contain non-ASCII are few and far Which leads me to a nitpicky (but important!) point about You might be interested in a replacement for this module Comments & feedback would be most welcome. |
Logged In: YES Turning into a Feature Request. |
Logged In: YES Reassigning to Skip: I don't use robotparser. Skip, perhaps you can have a look ? (Didn't you write the |
Logged In: YES I've also discovered that robotparser can get confused by In that case, robotparser fails to recognize the string I created a sample robots.txt file exactly like the one >>> import robotparser
>>> rp=robotparser.RobotFileParser()
>>>
rp.set_url("http://semanchuk.com/philip/boneyard/robots/robots.txt.bom")
>>> rp.read()
>>> rp.can_fetch("foobot", "/") # should return False
True
>>> My robot parser module doesn't suffer from the BOM bug |
No comments on this for 4 1/2 years. Is this still valid and/or is anyone still interested? |
While Python is 'GPL compatible', whatever that means, it cannot incorporate GPLed code in the PSF distribution. Code must be contributed under one on the two licenses in the contributor agreement. Philip, can you contribute a patch appropriate to 3.x? In 3.x, robotparser is urllib.robotparser. Under the 'be generous what you accept' principle, expansion of accepted names would seem to be good. DOC PATCH NEEDED The doc says "For more details on the structure of robots.txt files, see http://www.robotstxt.org/orig.html ." |
The .../orig.html link now works and was last updated in August. |
This has been abandoned for over a decade. Marking as pending and will close it soon unless someone will object. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: