Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robotparser crawl_delay and request_rate do not work with no matching entry #80103

Closed
jsm28 mannequin opened this issue Feb 6, 2019 · 8 comments
Closed

robotparser crawl_delay and request_rate do not work with no matching entry #80103

jsm28 mannequin opened this issue Feb 6, 2019 · 8 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@jsm28
Copy link
Mannequin

jsm28 mannequin commented Feb 6, 2019

BPO 35922
Nosy @gvanrossum, @orsenthil, @taleinat, @miss-islington, @remilapeyre, @jsm28
PRs
  • bpo-35922: Fix RobotFileParser when robots.txt is invalid #11791
  • [3.8] bpo-35922: Fix RobotFileParser when robots.txt has no relevant crawl delay or request rate (GH-11791) #14121
  • [3.7] bpo-35922: Fix RobotFileParser when robots.txt has no relevant crawl delay or request rate (GH-11791) #14122
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-06-16.07:14:52.654>
    created_at = <Date 2019-02-06.20:33:03.567>
    labels = ['3.7', '3.8', 'type-bug', 'library', '3.9']
    title = 'robotparser crawl_delay and request_rate do not work with no matching entry'
    updated_at = <Date 2019-06-16.07:14:52.653>
    user = 'https://github.com/jsm28'

    bugs.python.org fields:

    activity = <Date 2019-06-16.07:14:52.653>
    actor = 'taleinat'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-06-16.07:14:52.654>
    closer = 'taleinat'
    components = ['Library (Lib)']
    creation = <Date 2019-02-06.20:33:03.567>
    creator = 'joseph_myers'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 35922
    keywords = ['patch']
    message_count = 8.0
    messages = ['334982', '335093', '345251', '345296', '345730', '345732', '345733', '345734']
    nosy_count = 6.0
    nosy_names = ['gvanrossum', 'orsenthil', 'taleinat', 'miss-islington', 'remi.lapeyre', 'joseph_myers']
    pr_nums = ['11791', '14121', '14122']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue35922'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @jsm28
    Copy link
    Mannequin Author

    jsm28 mannequin commented Feb 6, 2019

    RobotFileParser.crawl_delay and RobotFileParser.request_rate raise AttributeError for a robots.txt with no matching entry for the given user-agent, including no default entry, rather than returning None which would be correct according to the documentation. E.g.:

    >>> from urllib.robotparser import RobotFileParser
    >>> parser = RobotFileParser()
    >>> parser.parse([])
    >>> parser.crawl_delay('example')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python3.6/urllib/robotparser.py", line 182, in crawl_delay
        return self.default_entry.delay
    AttributeError: 'NoneType' object has no attribute 'delay'

    @jsm28 jsm28 mannequin added 3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Feb 6, 2019
    @remilapeyre
    Copy link
    Mannequin

    remilapeyre mannequin commented Feb 8, 2019

    Thanks for your report Joseph, I opened a new PR to fix this.

    @taleinat taleinat added the 3.9 only security fixes label Jun 9, 2019
    @taleinat
    Copy link
    Contributor

    The PR is looking good, I'll likely merge it soon.

    I'm quite sure this should go into 3.8, but should it be backported to 3.7? This is certainly a bugfix, but still a slight change of behavior, so perhaps we should avoid changing this in 3.7?

    @gvanrossum
    Copy link
    Member

    Yes, this looks like a bugfix. Who wants an AttributeError? :-)

    @taleinat
    Copy link
    Contributor

    New changeset 8047e0e by Tal Einat (Rémi Lapeyre) in branch 'master':
    bpo-35922: Fix RobotFileParser when robots.txt has no relevant crawl delay or request rate (GH-11791)
    8047e0e

    @miss-islington
    Copy link
    Contributor

    New changeset 58a1a76 by Miss Islington (bot) in branch '3.8':
    bpo-35922: Fix RobotFileParser when robots.txt has no relevant crawl delay or request rate (GH-11791)
    58a1a76

    @miss-islington
    Copy link
    Contributor

    New changeset 45d6547 by Miss Islington (bot) in branch '3.7':
    bpo-35922: Fix RobotFileParser when robots.txt has no relevant crawl delay or request rate (GH-11791)
    45d6547

    @taleinat
    Copy link
    Contributor

    Rémi, thanks for the great work writing the PR and quickly going through several iterations of reviews and revisions!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants