Issue402229
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2000-11-02 17:36 by calvin, last changed 2022-04-10 16:03 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
None | calvin, 2000-11-02 17:36 | None |
Messages (10) | |||
---|---|---|---|
msg34748 - (view) | Author: Bastian Kleineidam (calvin) | Date: 2000-11-02 17:36 | |
|
|||
msg34749 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2000-11-02 19:14 | |
Skip, can you comment on this? |
|||
msg34750 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2001-01-05 02:31 | |
Skip, back to you. Please work with the author on an acceptable version. You can check it in once you two agree. |
|||
msg34751 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2001-01-19 22:57 | |
Skip, if this is ok with you, can you check it in? (Unless you feel you don't want to check it in because you still feel your module is better -- in that case we should probably drop it or reassign it...) |
|||
msg34752 - (view) | Author: Bastian Kleineidam (calvin) | Date: 2000-11-02 17:40 | |
I have written a new RobotParser module 'robotparser2.py'. This module is o backward compatible with the old one o makes correct useragent matching (is buggy in robotparser.py) o strips comments correctly (is buggy in robotparser.py) o uses httplib instead of urllib.urlopen() to catch HTTP connect errors correctly (is buggy in robotparser.py) o implements not only the draft at http://info.webcrawler.com/mak/projects/robots/norobots.html but also the new one at http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html Bastian Kleineidam |
|||
msg34753 - (view) | Author: Bastian Kleineidam (calvin) | Date: 2001-01-04 23:51 | |
Changes: - global debug variable in the test function - redirection now works - accidently printed "Allow" when I meant "Disallow". This has been fixed. It parses the Musi-Cal robots.txt file correctly, but the robots.txt file has syntax errors: before each user-agent: line there has to be one or more empty lines. |
|||
msg34754 - (view) | Author: Bastian Kleineidam (calvin) | Date: 2001-01-06 12:31 | |
Ok, some new changes: - allow parsing of user-agent: lines without preceding blank line - two licenses available: Python 2.0 license or GPL license - add some doc string for the classes Bastian |
|||
msg34755 - (view) | Author: Skip Montanaro (skip.montanaro) * | Date: 2001-01-04 21:05 | |
I apologize for taking so long to take a look at this. I was reminded of it when I saw the switch from me to Guido. I spent a little time fiddling with this module today. I'm not satisfied that it works as advertised. Here are a number of problems I found: * in the test function, the debug variable is not declared global, so setting it to 1 has no effect * it never seemed to properly handle redirections, so it never got from http://www.musi-cal.com/robots.txt to http://musi-cal.mojam.com/robots.txt * once I worked around the redirection problem it seemed to parse the Musi-Cal robots.txt file incorrectly. I replaced httplib with urllib in the read method and got erroneous results. If you look at the above robots.txt file you'll see that a bunch of email address harvesters are explicitly forbidden (not that they pay attention to robots.txt!). The following should print 0, but prints 1: print rp.can_fetch('ExtractorPro', 'http://musi-cal.mojam.com/') This is (at least in part) due to the fact that the redirection never works. In the version I modified to use urllib, it displays incorrect permissions for things like ExtractorPro: User-agent: ExtractorPro Allow: / Note that the lines in the robot.txt file for ExtractorPro are actually User-agent: ExtractorPro Disallow: / Skip |
|||
msg34756 - (view) | Author: Skip Montanaro (skip.montanaro) * | Date: 2001-01-05 01:43 | |
I fixed the robots.txt file, but I think you should parse files without the requisite blank lines (be lenient in what you accept and strict in what you generate). The user-agent line can serve as an implicit separator between one record and the next. Skip |
|||
msg34757 - (view) | Author: Skip Montanaro (skip.montanaro) * | Date: 2001-01-20 16:03 | |
Checked in and closed. Thanks Bastian! |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-10 16:03:28 | admin | set | github: 33441 |
2000-11-02 17:36:26 | calvin | create |