Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robotparser doesn't return crawl delay for default entry #69586

Closed
pwirtz mannequin opened this issue Oct 14, 2015 · 8 comments
Closed

robotparser doesn't return crawl delay for default entry #69586

pwirtz mannequin opened this issue Oct 14, 2015 · 8 comments
Labels
3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@pwirtz
Copy link
Mannequin

pwirtz mannequin commented Oct 14, 2015

BPO 25400
Nosy @berkerpeksag
PRs
  • [Do Not Merge] Convert Misc/NEWS so that it is managed by towncrier #552
  • Files
  • robotparser_crawl_delay.patch: patch
  • robotparser_crawl_delay_v2.patch
  • issue25400_v2.diff
  • issue25400_v3.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-09-18.17:18:17.950>
    created_at = <Date 2015-10-14.01:21:42.690>
    labels = ['3.7', 'type-bug', 'library']
    title = "robotparser doesn't return crawl delay for default entry"
    updated_at = <Date 2017-03-31.16:36:30.705>
    user = 'https://bugs.python.org/pwirtz'

    bugs.python.org fields:

    activity = <Date 2017-03-31.16:36:30.705>
    actor = 'dstufft'
    assignee = 'none'
    closed = True
    closed_date = <Date 2016-09-18.17:18:17.950>
    closer = 'berker.peksag'
    components = ['Library (Lib)']
    creation = <Date 2015-10-14.01:21:42.690>
    creator = 'pwirtz'
    dependencies = []
    files = ['40777', '40784', '44739', '44740']
    hgrepos = []
    issue_num = 25400
    keywords = ['patch']
    message_count = 8.0
    messages = ['252971', '252972', '253015', '253016', '253017', '275776', '276897', '276900']
    nosy_count = 3.0
    nosy_names = ['python-dev', 'berker.peksag', 'pwirtz']
    pr_nums = ['552']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue25400'
    versions = ['Python 3.6', 'Python 3.7']

    @pwirtz
    Copy link
    Mannequin Author

    pwirtz mannequin commented Oct 14, 2015

    After changeset http://hg.python.org/lookup/dbed7cacfb7e, calling the crawl_delay method for a robots.txt files that has a crawl-delay for * useragents always returns None.

    Ex:

    Python 3.6.0a0 (default:1aae9b6a6929+, Oct  9 2015, 22:08:05)
    [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import urllib.robotparser
    >>> parser = urllib.robotparser.RobotFileParser()
    >>> parser.set_url('https://www.carthage.edu/robots.txt')
    >>> parser.read()
    >>> parser.crawl_delay('test_robotparser')
    >>> parser.crawl_delay('*')
    >>> print(parser.default_entry.delay)
    120
    >>>

    Excerpt from https://www.carthage.edu/robots.txt:

    User-agent: *
    Crawl-Delay: 120
    Disallow: /cgi-bin

    I have written a patch that solves this. With patch, output is:

    Python 3.6.0a0 (default:1aae9b6a6929+, Oct  9 2015, 22:08:05)
    [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import urllib.robotparser
    >>> parser = urllib.robotparser.RobotFileParser()
    >>> parser.set_url('https://www.carthage.edu/robots.txt')
    >>> parser.read()
    >>> parser.crawl_delay('test_robotparser')
    120
    >>> parser.crawl_delay('*')
    120
    >>> print(parser.default_entry.delay)
    120
    >>>

    This also applies to the request_rate method.

    @pwirtz pwirtz mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Oct 14, 2015
    @pwirtz
    Copy link
    Mannequin Author

    pwirtz mannequin commented Oct 14, 2015

    This fix breaks the unit tests though. I am not sure how to go about checking those as this would be my first contribution to python and an open source project in general.

    @pwirtz
    Copy link
    Mannequin Author

    pwirtz mannequin commented Oct 14, 2015

    On further inspection of the tests, it appears that the way in which the tests are written, a test case can only be tested for one useragent at a time. I will attempt to work on the tests so work correctly. Any advice would be much appreciated.

    @berkerpeksag
    Copy link
    Member

    Thanks for the patch Peter(and welcome to Python and open source development). I have a WIP patch to rewrite test_robotparser in a less magic way. So we can ignore test failures for now. I'll take a closer look to your patch.

    @pwirtz
    Copy link
    Mannequin Author

    pwirtz mannequin commented Oct 14, 2015

    Ok, for the mean time, I reworked the test so it appears to test correctly and tests passes. There does seem to be some magic, so I do hope I did not overlook anything. Here is the new patch.

    @berkerpeksag
    Copy link
    Member

    I've now updated Lib/test/test_robotparser.py (bpo-25497) Peter, do you have time to update your patch? Thanks!

    @berkerpeksag
    Copy link
    Member

    Here's an updated patch.

    @berkerpeksag berkerpeksag added the 3.7 (EOL) end of life label Sep 18, 2016
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 18, 2016

    New changeset d5d910cfd288 by Berker Peksag in branch '3.6':
    Issue bpo-25400: RobotFileParser now correctly returns default values for crawl_delay and request_rate
    https://hg.python.org/cpython/rev/d5d910cfd288

    New changeset 911070065e38 by Berker Peksag in branch 'default':
    Issue bpo-25400: Merge from 3.6
    https://hg.python.org/cpython/rev/911070065e38

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant