Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False positive hazards in robotparser #65668

Closed
rhettinger opened this issue May 10, 2014 · 10 comments
Closed

False positive hazards in robotparser #65668

rhettinger opened this issue May 10, 2014 · 10 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@rhettinger
Copy link
Contributor

BPO 21469
Nosy @smontanaro, @rhettinger, @taleinat, @ethanfurman
Files
  • fix_false_pos.diff: Draft patch -- needs tests
  • fix_false_pos2.diff: Move modified() to parse()
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2014-05-13.05:58:20.393>
    created_at = <Date 2014-05-10.16:55:10.143>
    labels = ['type-bug', 'library']
    title = 'False positive hazards in robotparser'
    updated_at = <Date 2014-05-13.05:58:20.392>
    user = 'https://github.com/rhettinger'

    bugs.python.org fields:

    activity = <Date 2014-05-13.05:58:20.392>
    actor = 'rhettinger'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2014-05-13.05:58:20.393>
    closer = 'rhettinger'
    components = ['Library (Lib)']
    creation = <Date 2014-05-10.16:55:10.143>
    creator = 'rhettinger'
    dependencies = []
    files = ['35215', '35216']
    hgrepos = []
    issue_num = 21469
    keywords = ['patch']
    message_count = 10.0
    messages = ['218226', '218284', '218285', '218325', '218356', '218399', '218402', '218403', '218404', '218405']
    nosy_count = 5.0
    nosy_names = ['skip.montanaro', 'rhettinger', 'taleinat', 'ethan.furman', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'test needed'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue21469'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

    @rhettinger
    Copy link
    Contributor Author

    • The can_fetch() method is not checking to see if read() has been called, so it returns false positives if read() has not been called.

    • When read() is called, it fails to call modified() so that mtime() returns an incorrect result. The user has to manually call modified() to update the mtime().

    >>> from urllib.robotparser import RobotFileParser
    >>> rp = RobotFileParser('http://en.wikipedia.org/robots.txt')
    >>> rp.can_fetch('UbiCrawler', 'http://en.wikipedia.org/index.html')
    True
    >>> rp.read()
    >>> rp.can_fetch('UbiCrawler', 'http://en.wikipedia.org/index.html')
    False
    >>> rp.mtime()
    0
    >>> rp.modified()
    >>> rp.mtime()
    1399740268.628497

    Suggested improvements:

    1. Trigger internal calls to modified() every time the parse is modified using read() or add_entry(). That would assure that mtime() actually reflects the modification time.

    2. Raise an exception or return False whenever can_fetch() is called and the mtime() is zero (meaning that the parser has not be initialized with any rules).

    @rhettinger rhettinger added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels May 10, 2014
    @rhettinger
    Copy link
    Contributor Author

    Attaching a draft patch:

    • Repair the broken link to norobots-rfc.txt.

    • HTTP response codes >= 500 treated as a failed read rather than as a not found. Not found means that we can assume the entire site is allowed. A 5xx server error tells us nothing.

    • A successful read() updates the mtime (which is defined to be "the time the robots.txt file was last fetched").

    • The can_fetch() method returns False unless we've had a read() with a 2xx or 4xx response. This avoids false positives in the case where a user calls can_fetch() before calling read().

    @rhettinger rhettinger self-assigned this May 11, 2014
    @rhettinger
    Copy link
    Contributor Author

    Update patch to move the modified() call to parse(). That lets the mtime update whenever rules (either by a read() or by directly parsing text).

    @rhettinger rhettinger changed the title Hazards in robots.txt parser False positive hazards in robotparser May 11, 2014
    @taleinat
    Copy link
    Contributor

    Changes LGTM.

    This module could certainly use some cleanup and updates. For example, last_changed should be a property and always accessed one way (instead of either .mtime() or .last_changed) and should be initialized to None instead of zero to avoid ambiguity, and the and/or trick should be replaced with if/else. Would anyone review such a patch if I created one?

    @smontanaro
    Copy link
    Contributor

    Can this change be (easily) tested? If so, a test case akin to your original example would be nice.

    @rhettinger
    Copy link
    Contributor Author

    Changes LGTM.

    Thanks for the review :-)

    This module could certainly use some cleanup and updates.

    Yes, the API is a mess, but I would like to be very conservative with API modifications (preferably none at all) so we don't break the code of very few people who ever cared enough to use this module. My goal here was just to fix the risk of a false positives.

    For example, last_changed should be a property and always
    accessed one way (instead of either .mtime() or .last_changed)
    and should be initialized to None instead of zero to avoid ambiguity,

    It's too late for fixing the published API. The time for that was when the module was introduced.

    and the and/or trick should be replaced with if/else.

    Yes, would be a reasonable minor clean-up that wouldn't affect the API.

    Would anyone review such a patch if I created one?

    Yes. Just add the one-line patch to this tracker item and I'll incorporate it with the rest.

    FWIW, it is perfectly reasonable to add new well-designed API extensions. You can post patches to the open tracker items for Bug 16099 and Bug 21475.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 13, 2014

    New changeset 4ea86cd87f95 by Raymond Hettinger in branch '3.4':
    bpo-21469: Mitigate risk of false positives with robotparser.
    http://hg.python.org/cpython/rev/4ea86cd87f95

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 13, 2014

    New changeset f67cf5747a26 by Raymond Hettinger in branch '3.4':
    bpo-21469: Add missing news item
    http://hg.python.org/cpython/rev/f67cf5747a26

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 13, 2014

    New changeset d4fd55278cec by Raymond Hettinger in branch '2.7':
    bpo-21469: Mitigate risk of false positives with robotparser.
    http://hg.python.org/cpython/rev/d4fd55278cec

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 13, 2014

    New changeset 560320c10564 by Raymond Hettinger in branch 'default':
    bpo-21469: Minor code modernization (convert and/or expression to an if/else expression).
    http://hg.python.org/cpython/rev/560320c10564

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants