classification
Title: urllib.robotparser: incomplete __str__ methods
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.8, Python 3.7, Python 3.6, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: michael-lazar, orsenthil, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2018-02-17 00:52 by michael-lazar, last changed 2018-05-14 22:10 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 5711 merged michael-lazar, 2018-02-17 01:56
PR 6795 merged miss-islington, 2018-05-14 14:11
PR 6796 closed miss-islington, 2018-05-14 14:13
PR 6815 closed miss-islington, 2018-05-14 18:14
PR 6817 merged serhiy.storchaka, 2018-05-14 18:35
PR 6818 merged miss-islington, 2018-05-14 18:44
Messages (7)
msg312259 - (view) Author: Michael Lazar (michael-lazar) * Date: 2018-02-17 00:52
Hello,

I have stumbled upon a couple of inconsistencies in urllib.robotparser's __str__ methods.

These appear to be unintentional omissions; basically the code was modified but the string methods were never updated.

1. The RobotFileParser.__str__ method doesn't include the default (*) User-agent entry.

    >>> from urllib.robotparser import RobotFileParser
    >>> parser = RobotFileParser()
    >>> text = """
    ... User-agent: *
    ... Allow: /some/path
    ... Disallow: /another/path
    ...
    ... User-agent: Googlebot
    ... Allow: /folder1/myfile.html
    ... """
    >>> parser.parse(text.splitlines())
    >>> print(parser)
    User-agent: Googlebot
    Allow: /folder1/myfile.html
    
    
    >>>

This is *especially* awkward when parsing a valid robots.txt that only contains a wildcard User-agent.

    >>> from urllib.robotparser import RobotFileParser
    >>> parser = RobotFileParser()
    >>> text = """
    ... User-agent: *
    ... Allow: /some/path
    ... Disallow: /another/path
    ... """
    >>> parser.parse(text.splitlines())
    >>> print(parser)
    
    
    >>>


2. Support was recently added for `Crawl-delay` and `Request-Rate` lines, but __str__ does not include these.

    >>> from urllib.robotparser import RobotFileParser
    >>> parser = RobotFileParser()
    >>> text = """
    ... User-agent: figtree
    ... Crawl-delay: 3
    ... Request-rate: 9/30
    ... Disallow: /tmp
    ... """
    >>> parser.parse(text.splitlines())
    >>> print(parser)
    User-agent: figtree
    Disallow: /tmp


    >>>

3. Two unnecessary trailing newlines are being appended to the string output (one for the last RuleLine and one for the last Entry)

    (see above examples)


Taken on their own these are all minor issues, but they do make things quite confusing when using robotparser from the REPL!
msg314527 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-03-27 14:18
The default entry was moved out of entries added in issue523041, but RobotFileParser.__str__ was not updated. Support for "Crawl-delay" and "Request-Rate" was added in issue16099, but Entry.__str__ was not updated. This looks like bugs to me, and I think the fix should be backported.

But two unnecessary trailing newlines should be kept for compatibility in maintained versions. I think we can get rid of them in 3.8 (unless Senthil has other opinion).
msg314821 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2018-04-02 18:57
> But two unnecessary trailing newlines should be kept for compatibility in maintained versions.

Yup, that sounds good to me. It doesn't seem like any RFC requirements. It's just kept for the compatibility and we can do away with it in 3.8
msg316505 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-14 14:10
New changeset bd08a0af2d88c590ede762102bd42da3437e9980 by Serhiy Storchaka (Michael Lazar) in branch 'master':
bpo-32861: urllib.robotparser fix incomplete __str__ methods. (GH-5711)
https://github.com/python/cpython/commit/bd08a0af2d88c590ede762102bd42da3437e9980
msg316546 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-14 18:14
New changeset c3fa1f2b93fa4bf96a8aadc74ee196384cefa31e by Serhiy Storchaka (Miss Islington (bot)) in branch '3.7':
[3.7] bpo-32861: urllib.robotparser fix incomplete __str__ methods. (GH-5711) (GH-6795)
https://github.com/python/cpython/commit/c3fa1f2b93fa4bf96a8aadc74ee196384cefa31e
msg316589 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-14 22:03
New changeset 3936fd7b2c271f723d1a98fda3ca9c7efd329c04 by Serhiy Storchaka (Miss Islington (bot)) in branch '3.6':
[3.7] bpo-32861: urllib.robotparser fix incomplete __str__ methods. (GH-5711) (GH-6795) (GH-6818)
https://github.com/python/cpython/commit/3936fd7b2c271f723d1a98fda3ca9c7efd329c04
msg316591 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-14 22:09
New changeset 861d38443d4b85cdc7b87afc4adee55f51c2f4b3 by Serhiy Storchaka in branch '2.7':
[2.7] bpo-32861: robotparser fix incomplete __str__ methods. (GH-5711) (GH-6795) (GH-6817)
https://github.com/python/cpython/commit/861d38443d4b85cdc7b87afc4adee55f51c2f4b3
History
Date User Action Args
2018-05-14 22:10:37serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2018-05-14 22:09:49serhiy.storchakasetmessages: + msg316591
2018-05-14 22:03:58serhiy.storchakasetmessages: + msg316589
2018-05-14 18:44:00miss-islingtonsetpull_requests: + pull_request6503
2018-05-14 18:35:04serhiy.storchakasetpull_requests: + pull_request6502
2018-05-14 18:14:48miss-islingtonsetpull_requests: + pull_request6501
2018-05-14 18:14:35serhiy.storchakasetmessages: + msg316546
2018-05-14 14:13:52miss-islingtonsetpull_requests: + pull_request6481
2018-05-14 14:11:51miss-islingtonsetpull_requests: + pull_request6480
2018-05-14 14:10:44serhiy.storchakasetmessages: + msg316505
2018-04-02 18:57:07orsenthilsetmessages: + msg314821
2018-03-27 14:18:48serhiy.storchakasetnosy: + serhiy.storchaka

messages: + msg314527
versions: + Python 2.7, Python 3.6, Python 3.7
2018-02-23 23:44:13terry.reedysetnosy: + orsenthil
2018-02-17 01:56:15michael-lazarsetkeywords: + patch
stage: patch review
pull_requests: + pull_request5500
2018-02-17 00:52:46michael-lazarcreate