Message 312259 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	michael-lazar
Recipients	michael-lazar
Date	2018-02-17.00:52:46
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1518828766.79.0.467229070634.issue32861@psf.upfronthosting.co.za>
In-reply-to

Content
Hello, I have stumbled upon a couple of inconsistencies in urllib.robotparser's __str__ methods. These appear to be unintentional omissions; basically the code was modified but the string methods were never updated. 1. The RobotFileParser.__str__ method doesn't include the default () User-agent entry. >>> from urllib.robotparser import RobotFileParser >>> parser = RobotFileParser() >>> text = """ ... User-agent: ... Allow: /some/path ... Disallow: /another/path ... ... User-agent: Googlebot ... Allow: /folder1/myfile.html ... """ >>> parser.parse(text.splitlines()) >>> print(parser) User-agent: Googlebot Allow: /folder1/myfile.html >>> This is especially awkward when parsing a valid robots.txt that only contains a wildcard User-agent. >>> from urllib.robotparser import RobotFileParser >>> parser = RobotFileParser() >>> text = """ ... User-agent: * ... Allow: /some/path ... Disallow: /another/path ... """ >>> parser.parse(text.splitlines()) >>> print(parser) >>> 2. Support was recently added for `Crawl-delay` and `Request-Rate` lines, but __str__ does not include these. >>> from urllib.robotparser import RobotFileParser >>> parser = RobotFileParser() >>> text = """ ... User-agent: figtree ... Crawl-delay: 3 ... Request-rate: 9/30 ... Disallow: /tmp ... """ >>> parser.parse(text.splitlines()) >>> print(parser) User-agent: figtree Disallow: /tmp >>> 3. Two unnecessary trailing newlines are being appended to the string output (one for the last RuleLine and one for the last Entry) (see above examples) Taken on their own these are all minor issues, but they do make things quite confusing when using robotparser from the REPL!

Hello,

I have stumbled upon a couple of inconsistencies in urllib.robotparser's __str__ methods.

These appear to be unintentional omissions; basically the code was modified but the string methods were never updated.

1. The RobotFileParser.__str__ method doesn't include the default (*) User-agent entry.

    >>> from urllib.robotparser import RobotFileParser
    >>> parser = RobotFileParser()
    >>> text = """
    ... User-agent: *
    ... Allow: /some/path
    ... Disallow: /another/path
    ...
    ... User-agent: Googlebot
    ... Allow: /folder1/myfile.html
    ... """
    >>> parser.parse(text.splitlines())
    >>> print(parser)
    User-agent: Googlebot
    Allow: /folder1/myfile.html
    
    
    >>>

This is *especially* awkward when parsing a valid robots.txt that only contains a wildcard User-agent.

    >>> from urllib.robotparser import RobotFileParser
    >>> parser = RobotFileParser()
    >>> text = """
    ... User-agent: *
    ... Allow: /some/path
    ... Disallow: /another/path
    ... """
    >>> parser.parse(text.splitlines())
    >>> print(parser)
    
    
    >>>


2. Support was recently added for `Crawl-delay` and `Request-Rate` lines, but __str__ does not include these.

    >>> from urllib.robotparser import RobotFileParser
    >>> parser = RobotFileParser()
    >>> text = """
    ... User-agent: figtree
    ... Crawl-delay: 3
    ... Request-rate: 9/30
    ... Disallow: /tmp
    ... """
    >>> parser.parse(text.splitlines())
    >>> print(parser)
    User-agent: figtree
    Disallow: /tmp


    >>>

3. Two unnecessary trailing newlines are being appended to the string output (one for the last RuleLine and one for the last Entry)

    (see above examples)


Taken on their own these are all minor issues, but they do make things quite confusing when using robotparser from the REPL!

History
Date	User	Action	Args
2018-02-17 00:52:46	michael-lazar	set	recipients: + michael-lazar
2018-02-17 00:52:46	michael-lazar	set	messageid: <1518828766.79.0.467229070634.issue32861@psf.upfronthosting.co.za>
2018-02-17 00:52:46	michael-lazar	link	issue32861 messages
2018-02-17 00:52:46	michael-lazar	create