Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib.robotparser: incomplete __str__ methods #77042

Closed
michael-lazar mannequin opened this issue Feb 17, 2018 · 7 comments
Closed

urllib.robotparser: incomplete __str__ methods #77042

michael-lazar mannequin opened this issue Feb 17, 2018 · 7 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@michael-lazar
Copy link
Mannequin

michael-lazar mannequin commented Feb 17, 2018

BPO 32861
Nosy @orsenthil, @serhiy-storchaka, @michael-lazar
PRs
  • bpo-32861: urllib.robotparser fix incomplete __str__ methods. #5711
  • [3.7] bpo-32861: urllib.robotparser fix incomplete __str__ methods. (GH-5711) #6795
  • [3.6] bpo-32861: urllib.robotparser fix incomplete __str__ methods. (GH-5711) #6796
  • [3.6] [3.7] bpo-32861: urllib.robotparser fix incomplete __str__ methods. (GH-5711) (GH-6795) #6815
  • [2.7] bpo-32861: urllib.robotparser fix incomplete __str__ methods. (GH-5711) (GH-6795) #6817
  • [3.6] bpo-32861: urllib.robotparser fix incomplete __str__ methods. (GH-5711) (GH-6795) #6818
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-05-14.22:10:37.606>
    created_at = <Date 2018-02-17.00:52:46.756>
    labels = ['3.7', '3.8', 'type-bug', 'library']
    title = 'urllib.robotparser: incomplete __str__ methods'
    updated_at = <Date 2018-05-14.22:10:37.606>
    user = 'https://github.com/michael-lazar'

    bugs.python.org fields:

    activity = <Date 2018-05-14.22:10:37.606>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-05-14.22:10:37.606>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2018-02-17.00:52:46.756>
    creator = 'michael-lazar'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 32861
    keywords = ['patch']
    message_count = 7.0
    messages = ['312259', '314527', '314821', '316505', '316546', '316589', '316591']
    nosy_count = 3.0
    nosy_names = ['orsenthil', 'serhiy.storchaka', 'michael-lazar']
    pr_nums = ['5711', '6795', '6796', '6815', '6817', '6818']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue32861'
    versions = ['Python 2.7', 'Python 3.6', 'Python 3.7', 'Python 3.8']

    @michael-lazar
    Copy link
    Mannequin Author

    michael-lazar mannequin commented Feb 17, 2018

    Hello,

    I have stumbled upon a couple of inconsistencies in urllib.robotparser's __str__ methods.

    These appear to be unintentional omissions; basically the code was modified but the string methods were never updated.

    1. The RobotFileParser.__str__ method doesn't include the default (*) User-agent entry.
        >>> from urllib.robotparser import RobotFileParser
        >>> parser = RobotFileParser()
        >>> text = """
        ... User-agent: *
        ... Allow: /some/path
        ... Disallow: /another/path
        ...
        ... User-agent: Googlebot
        ... Allow: /folder1/myfile.html
        ... """
        >>> parser.parse(text.splitlines())
        >>> print(parser)
        User-agent: Googlebot
        Allow: /folder1/myfile.html
        
        
        >>>

    This is *especially* awkward when parsing a valid robots.txt that only contains a wildcard User-agent.

        >>> from urllib.robotparser import RobotFileParser
        >>> parser = RobotFileParser()
        >>> text = """
        ... User-agent: *
        ... Allow: /some/path
        ... Disallow: /another/path
        ... """
        >>> parser.parse(text.splitlines())
        >>> print(parser)
        
        
        >>>
    1. Support was recently added for Crawl-delay and Request-Rate lines, but str does not include these.
        >>> from urllib.robotparser import RobotFileParser
        >>> parser = RobotFileParser()
        >>> text = """
        ... User-agent: figtree
        ... Crawl-delay: 3
        ... Request-rate: 9/30
        ... Disallow: /tmp
        ... """
        >>> parser.parse(text.splitlines())
        >>> print(parser)
        User-agent: figtree
        Disallow: /tmp
    >>>
    
    1. Two unnecessary trailing newlines are being appended to the string output (one for the last RuleLine and one for the last Entry)

      (see above examples)

    Taken on their own these are all minor issues, but they do make things quite confusing when using robotparser from the REPL!

    @michael-lazar michael-lazar mannequin added 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Feb 17, 2018
    @serhiy-storchaka
    Copy link
    Member

    The default entry was moved out of entries added in bpo-523041, but RobotFileParser.__str__ was not updated. Support for "Crawl-delay" and "Request-Rate" was added in bpo-16099, but Entry.__str__ was not updated. This looks like bugs to me, and I think the fix should be backported.

    But two unnecessary trailing newlines should be kept for compatibility in maintained versions. I think we can get rid of them in 3.8 (unless Senthil has other opinion).

    @serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label Mar 27, 2018
    @orsenthil
    Copy link
    Member

    But two unnecessary trailing newlines should be kept for compatibility in maintained versions.

    Yup, that sounds good to me. It doesn't seem like any RFC requirements. It's just kept for the compatibility and we can do away with it in 3.8

    @serhiy-storchaka
    Copy link
    Member

    New changeset bd08a0a by Serhiy Storchaka (Michael Lazar) in branch 'master':
    bpo-32861: urllib.robotparser fix incomplete __str__ methods. (GH-5711)
    bd08a0a

    @serhiy-storchaka
    Copy link
    Member

    New changeset c3fa1f2 by Serhiy Storchaka (Miss Islington (bot)) in branch '3.7':
    [3.7] bpo-32861: urllib.robotparser fix incomplete __str__ methods. (GH-5711) (GH-6795)
    c3fa1f2

    @serhiy-storchaka
    Copy link
    Member

    New changeset 3936fd7 by Serhiy Storchaka (Miss Islington (bot)) in branch '3.6':
    [3.7] bpo-32861: urllib.robotparser fix incomplete __str__ methods. (GH-5711) (GH-6795) (GH-6818)
    3936fd7

    @serhiy-storchaka
    Copy link
    Member

    New changeset 861d384 by Serhiy Storchaka in branch '2.7':
    [2.7] bpo-32861: robotparser fix incomplete __str__ methods. (GH-5711) (GH-6795) (GH-6817)
    861d384

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants