Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robotparser doesn't support request rate and crawl delay parameters #60303

Closed
XapaJIaMnu mannequin opened this issue Oct 1, 2012 · 17 comments
Closed

robotparser doesn't support request rate and crawl delay parameters #60303

XapaJIaMnu mannequin opened this issue Oct 1, 2012 · 17 comments
Assignees
Labels
easy stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@XapaJIaMnu
Copy link
Mannequin

XapaJIaMnu mannequin commented Oct 1, 2012

BPO 16099
Nosy @rhettinger, @orsenthil, @tiran, @berkerpeksag, @hynek
Files
  • robotparser.patch: patch for robotparser.py
  • robotparser.patch: same patch for python3X
  • robotparser.patch: Changes + test cases + documentation
  • robotparser_reformatted.patch: Changes, test cases, documentation, reformatted
  • robotparser_v2.patch: V2 with fixes
  • robotparser_v3.patch: V3 crawl delay and request rate patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/berkerpeksag'
    closed_at = <Date 2015-10-08.09:34:21.553>
    created_at = <Date 2012-10-01.12:58:25.141>
    labels = ['easy', 'type-feature', 'library']
    title = "robotparser doesn't support request rate and crawl delay parameters"
    updated_at = <Date 2015-10-08.09:34:21.551>
    user = 'https://bugs.python.org/XapaJIaMnu'

    bugs.python.org fields:

    activity = <Date 2015-10-08.09:34:21.551>
    actor = 'berker.peksag'
    assignee = 'berker.peksag'
    closed = True
    closed_date = <Date 2015-10-08.09:34:21.553>
    closer = 'berker.peksag'
    components = ['Library (Lib)']
    creation = <Date 2012-10-01.12:58:25.141>
    creator = 'XapaJIaMnu'
    dependencies = []
    files = ['27373', '27374', '27476', '27477', '33071', '35377']
    hgrepos = []
    issue_num = 16099
    keywords = ['patch', 'easy', 'needs review']
    message_count = 17.0
    messages = ['171711', '171712', '171715', '171719', '172327', '172338', '205567', '205641', '205755', '205761', '208721', '219212', '223099', '225916', '252483', '252521', '252525']
    nosy_count = 7.0
    nosy_names = ['rhettinger', 'orsenthil', 'christian.heimes', 'python-dev', 'berker.peksag', 'hynek', 'XapaJIaMnu']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue16099'
    versions = ['Python 3.6']

    @XapaJIaMnu
    Copy link
    Mannequin Author

    XapaJIaMnu mannequin commented Oct 1, 2012

    Robotparser doesn't support two quite important optional parameters from the robots.txt file. I have implemented those in the following way:
    (Robotparser should be initialized in the usual way:
    rp = robotparser.RobotFileParser()
    rp.set_url(..)
    rp.read
    )

    crawl_delay(useragent) - Returns time in seconds that you need to wait for crawling
    if none is specified, or doesn't apply to this user agent, returns -1
    request_rate(useragent) - Returns a list in the form [request,seconds].
    if none is specified, or doesn't apply to this user agent, returns -1

    @XapaJIaMnu XapaJIaMnu mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Oct 1, 2012
    @tiran
    Copy link
    Member

    tiran commented Oct 1, 2012

    Thanks for the patch. New features must be implemented in Python 3.4. Python 2.7 is in feature freeze mode and therefore doesn't get new features.

    @XapaJIaMnu
    Copy link
    Mannequin Author

    XapaJIaMnu mannequin commented Oct 1, 2012

    Okay, sorry didn't know that (:
    Here's the same patch (Same functionality) for python3

    Feedback is welcome, as always (:

    @tiran
    Copy link
    Member

    tiran commented Oct 1, 2012

    We have a team that mentors new contributors. If you are interested to get your patch into Python 3.4, please read http://pythonmentors.com/ . The people are really friendly and will help you with every step of the process.

    @tiran tiran added the easy label Oct 1, 2012
    @XapaJIaMnu
    Copy link
    Mannequin Author

    XapaJIaMnu mannequin commented Oct 7, 2012

    Okay, here's a proper patch with documentation entry and test cases.
    Please review and comment

    @XapaJIaMnu
    Copy link
    Mannequin Author

    XapaJIaMnu mannequin commented Oct 7, 2012

    Reformatted patch

    @XapaJIaMnu
    Copy link
    Mannequin Author

    XapaJIaMnu mannequin commented Dec 8, 2013

    Hey,
    it has been more than an year since the last activity.
    Is there anything else I should do in order for someone of the python devs team to review my changes and perhaps give some feedback?

    Nick

    @berkerpeksag
    Copy link
    Member

    I left a few comments on Rietveld.

    @XapaJIaMnu
    Copy link
    Mannequin Author

    XapaJIaMnu mannequin commented Dec 10, 2013

    Thank you for the review!
    I have addressed your comments and release a v2 of the patch:
    Highlights:
    No longer crashes when provided with malformed crawl-delay/robots.txt parameter.
    Returns None when parameter is missing or syntax is invalid.
    Simplified several functions.
    Extended tests.

    http://bugs.python.org/review/16099/diff/6206/Doc/library/urllib.robotparser.rst
    File Doc/library/urllib.robotparser.rst (right):

    http://bugs.python.org/review/16099/diff/6206/Doc/library/urllib.robotparser....
    Doc/library/urllib.robotparser.rst:56: .. method:: crawl_delay(useragent)
    On 2013/12/09 03:30:54, berkerpeksag wrote:

    Is crawl_delay used for search engines? Google recommends you to set crawl speed
    via Google Webmaster Tools instead.

    See https://support.google.com/webmasters/answer/48620?hl=en.

    Crawl delay and request rate parameters are targeted to custom crawlers that many people/companies write for specific tasks. The Google webmaster tools is targeted only to google's crawler and typically web admins have different rates for google/yahoo/bing and all other user agents.

    http://bugs.python.org/review/16099/diff/6206/Lib/urllib/robotparser.py
    File Lib/urllib/robotparser.py (right):

    http://bugs.python.org/review/16099/diff/6206/Lib/urllib/robotparser.py#newco...
    Lib/urllib/robotparser.py:168: for entry in self.entries:
    On 2013/12/09 03:30:54, berkerpeksag wrote:

    Is there a better way to calculate this? (perhaps O(1)?)

    I have followed the model of what was written beforehand. A 0(1) implementation (probably based on dictionaries) would require a complete rewrite of this library, as all previously implemented functions employ the:
    for entry in self.entries:
    if entry.applies_to(useragent):

    logic. I don't think this matters particularly here, as those two functions in particular need only be called once per domain and robots.txt seldom contains more than 3 entries. This is why I have just followed the design laid out by the original developer.

    Thanks

    Nick

    @XapaJIaMnu
    Copy link
    Mannequin Author

    XapaJIaMnu mannequin commented Dec 10, 2013

    Oh... Sorry for the spam, could you please verify my documentation link syntax. I'm not entirely sure I got it right.

    @XapaJIaMnu
    Copy link
    Mannequin Author

    XapaJIaMnu mannequin commented Jan 21, 2014

    Hey,

    Just a reminder friendly reminder that there hasn't been any activity for a month and I have released a v2, pending for review (:

    @rhettinger rhettinger assigned rhettinger and unassigned rhettinger May 12, 2014
    @XapaJIaMnu
    Copy link
    Mannequin Author

    XapaJIaMnu mannequin commented May 27, 2014

    Updated patch, all comments addressed, sorry for the 6 months delay. Please review

    @berkerpeksag berkerpeksag self-assigned this Jul 7, 2014
    @XapaJIaMnu
    Copy link
    Mannequin Author

    XapaJIaMnu mannequin commented Jul 15, 2014

    Hey,

    Just a friendly reminder that there has been no activity for a month and a half and v3 is pending for review (:

    @XapaJIaMnu
    Copy link
    Mannequin Author

    XapaJIaMnu mannequin commented Aug 26, 2014

    Hey,

    Just a friendly reminder that the patch is pending for review and there has been no activity for 3 months (:

    @XapaJIaMnu
    Copy link
    Mannequin Author

    XapaJIaMnu mannequin commented Oct 7, 2015

    Hey,

    Friendly reminder that there has been no activity on this issue for more than an year.

    Cheers,

    Nick

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 8, 2015

    New changeset dbed7cacfb7e by Berker Peksag in branch 'default':
    Issue bpo-16099: RobotFileParser now supports Crawl-delay and Request-rate
    https://hg.python.org/cpython/rev/dbed7cacfb7e

    @berkerpeksag
    Copy link
    Member

    I've finally committed your patch to default. Thank you for not giving up, Nikolay :)

    Note that currently the link in the example section doesn't work. I will open a new issue for that.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    easy stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants