Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow unicode arguments for robotparser.can_fetch #42942

Closed
osvenskan mannequin opened this issue Feb 23, 2006 · 10 comments
Closed

allow unicode arguments for robotparser.can_fetch #42942

osvenskan mannequin opened this issue Feb 23, 2006 · 10 comments
Labels
pending The issue will be closed if no feedback is provided topic-unicode type-feature A feature request or enhancement

Comments

@osvenskan
Copy link
Mannequin

osvenskan mannequin commented Feb 23, 2006

BPO 1437699
Nosy @malemburg, @birkenfeld, @terryjreedy, @osvenskan
Files
  • PythonSessionsShowingRobotParserError.txt: Interactive Python sessions (2.4 & 2.3) showing how to recreate
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2006-02-23.21:07:54.000>
    labels = ['type-feature', 'expert-unicode']
    title = 'allow unicode arguments for robotparser.can_fetch'
    updated_at = <Date 2014-02-03.19:40:45.276>
    user = 'https://github.com/osvenskan'

    bugs.python.org fields:

    activity = <Date 2014-02-03.19:40:45.276>
    actor = 'BreamoreBoy'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Unicode']
    creation = <Date 2006-02-23.21:07:54.000>
    creator = 'osvenskan'
    dependencies = []
    files = ['8267']
    hgrepos = []
    issue_num = 1437699
    keywords = []
    message_count = 9.0
    messages = ['54740', '54741', '54742', '54743', '54744', '54745', '115006', '115022', '121019']
    nosy_count = 4.0
    nosy_names = ['lemburg', 'georg.brandl', 'terry.reedy', 'osvenskan']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'test needed'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue1437699'
    versions = ['Python 3.2']

    @osvenskan
    Copy link
    Mannequin Author

    osvenskan mannequin commented Feb 23, 2006

    One-line summary: If the robotparser module encounters
    a robots.txt file that contains non-ASCII characters
    AND I pass a Unicode user agent string to can_fetch(),
    that function crashes with a TypeError under Python
    2.4. Under Python 2.3, the error is a UnicodeDecodeError.

    More detail:
    When one calls can_fetch(MyUserAgent, url), the
    robotparser module compares the UserAgent to each user
    agent described in the robots.txt file. If
    isinstance(MyUserAgent, str) == True then the
    comparison does not raise an error regardless of the
    contents of robots.txt. However, if
    isinstance(MyUserAgent, unicode) == True, then Python
    implicitly tries to convert the contents of the
    robots.txt file to Unicode before comparing it to
    MyUserAgent. By default, Python assumes a US-ASCII
    encoding when converting, so if the contents of
    robots.txt aren't ASCII, the conversion fails. In other
    words, this works:
    MyRobotParser.can_fetch('foobot', url)
    but this fails:
    MyRobotParser.can_fetch(u'foobot', url)

    I recreated this with Python 2.4.1 on FreeBSD 6 and
    Python 2.3 under Darwin/OS X. I'll attach examples from
    both. The URLs that I use in the attachments are from
    my Web site and will remain live. They reference
    robots.txt files which contain an umlaut-ed 'a' (0xe4
    in iso-8859-1). They're served up using a special
    .htaccess file that adds a Content-Type header which
    correctly identifies the encoding used for each file.
    Here's the contents of the .htaccess file:

    AddCharset iso-8859-1 .iso8859-1
    AddCharset utf-8 .utf8

    A suggested solution:
    AFAICT, the construction of robots.txt is still defined
    by "a consensus on 30 June 1994 on the robots mailing
    list" [http://www.robotstxt.org/wc/norobots.html] and a
    1996 draft proposal
    [http://www.robotstxt.org/wc/norobots-rfc.html] that
    has never evolved into a formal standard. Neither of
    these mention character sets or encodings which is no
    surprise considering that they date back to the days
    when the Internet was poor but happy and we considered
    even ASCII a luxury and we were grateful to have it.
    ("ASCII? We used to dream of having ASCII. We only had
    one bit, and it was a zero. We lived in a shoebox in
    the middle of the road..." etc.) A backwards-compatible
    yet forward-looking solution would be to have the
    robotparser module respect the Content-Type header sent
    with robots.txt. If no such header is present,
    robotparser should try to decode it using iso-8859-1
    per section 3.7.1 of the HTTP 1.1 spec
    (http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1)
    which says, 'When no explicit charset parameter is
    provided by the sender, media subtypes of the "text"
    type are defined to have a default charset value of
    "ISO-8859-1" when received via HTTP. Data in character
    sets other than "ISO-8859-1" or its subsets MUST be
    labeled with an appropriate charset value.' Section
    3.6.1 of the HTTP 1.0 spec says the same. Since
    ISO-8859-1 is a superset of US-ASCII, robots.txt files
    that are pure ASCII won't be affected by the change.

    @osvenskan osvenskan mannequin assigned smontanaro Feb 23, 2006
    @osvenskan osvenskan mannequin added topic-unicode type-feature A feature request or enhancement labels Feb 23, 2006
    @osvenskan osvenskan mannequin assigned smontanaro Feb 23, 2006
    @osvenskan osvenskan mannequin added topic-unicode type-feature A feature request or enhancement labels Feb 23, 2006
    @terryjreedy
    Copy link
    Member

    Logged In: YES
    user_id=593130

    To me, this is not a bug report but at best an RFE. The
    reported behavior is what I would expect. I read both
    module doc and the referenced web page and further links.
    The doc does not mention Unicode as allowed and the 300
    registered UserAgents at
    http://www.robotstxt.org/wc/active/html/index.html
    all have ascii names.

    So I recomment closing this as a bug report but will give
    ML a chance to respond. If switched instead to Feature
    Request, I would think it would need some 'in the wild'
    evidence of need.

    @osvenskan
    Copy link
    Mannequin Author

    osvenskan mannequin commented Mar 7, 2006

    Logged In: YES
    user_id=1119995

    Thanks for looking at this. I have some followup comments.

    The list at robotstxt.org is many years stale (note that
    Google's bot is present only as Backrub which was still a
    server at Stanford at the time:
    http://www.robotstxt.org/wc/active/html/backrub.html) but
    nevertheless AFAICT it is the most current bot list on the
    Web. If you look carefully, the list *does* contain a
    non-ASCII entry (#76 --easy to miss in that long list). That
    Finnish bot is gone but it has left a legacy in the form of
    many robots.txt files that were created by automated tools
    based on the robotstxt.org list. Google helps us here:
    http://www.google.com/search?q=allintext%3AH%C3%A4m%C3%A4h%C3%A4kki+disallow+filetype%3Atxt

    And by Googling for some common non-ASCII words and letters
    I can find more like this one (look at the end of the
    alphabetical list):
    http://paranormal.se/robots.txt

    Robots.txt files that contain non-ASCII are few and far
    between, it seems, but they're out there.

    Which leads me to a nitpicky (but important!) point about
    Unicode. As you point out, the spec doesn't mention Unicode;
    it says nothing at all on the topic of encodings. My
    argument is that just because the spec doesn't mention
    encodings doesn't let us off the hook because the HTTP
    1.0/1.1 specs are very clear that iso-8859-1, not US-ASCII,
    is the default for text content delivered via HTTP. By my
    interpretation, this means that the robots.txt examples
    provided above are compliant with published specs, therefore
    code that fails to interpret them does not comply. There's
    no obvious need for robotparser to support full-blown
    Unicode, just iso-8859-1.

    You might be interested in a replacement for this module
    that I've implemented. It does everything that robotparser
    does and also handles non-ASCII plus a few other things. It
    is GPL; you're welcome to copy it in part or lock, stock and
    barrel. So far I've only tested it "in the lab" but I've
    done fairly extensive unit testing and I'll soon be testing
    it on real-world data. The code and docs are here:
    http://semanchuk.com/philip/boneyard/rerp/

    Comments & feedback would be most welcome.

    @birkenfeld
    Copy link
    Member

    Logged In: YES
    user_id=849994

    Turning into a Feature Request.

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Reassigning to Skip: I don't use robotparser.

    Skip, perhaps you can have a look ? (Didn't you write the
    robotparser ?)

    @osvenskan
    Copy link
    Mannequin Author

    osvenskan mannequin commented Apr 6, 2006

    Logged In: YES
    user_id=1119995

    I've also discovered that robotparser can get confused by
    files with BOMs (byte order marks). At minimum it should
    ignore BOMs, at best it should use them as clues as to the
    file's encoding. It does neither, and instead treats the BOM
    as character data. That's especially problematic when the
    robots.txt file consists of this:
    [BOM]User-agent: *
    Disallow: /

    In that case, robotparser fails to recognize the string
    "User-agent", so the disallow rule is ignored, which in turn
    means it treats the file as empty and all robots are
    permitted everywhere which is the exact opposite of what the
    author intended. If the first line is a comment, then
    robotparser doesn't get confused regardless of whether or
    not there's a BOM.

    I created a sample robots.txt file exactly like the one
    above; it contains a utf-8 BOM. The example below uses this
    file which is on my Web site.

    >>> import robotparser
    >>> rp=robotparser.RobotFileParser()
    >>>
    rp.set_url("http://semanchuk.com/philip/boneyard/robots/robots.txt.bom")
    >>> rp.read()
    >>> rp.can_fetch("foobot", "/")  # should return False
    True
    >>> 

    My robot parser module doesn't suffer from the BOM bug
    (although it doesn't use BOMs to decode the file, either,
    which it really ought to). As I said before, You're welcome
    to steal code from it or copy it wholesale (it is GPL).
    Also, I'll be happy to open a different bug report if you
    feel like this should be a separate issue.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Aug 26, 2010

    No comments on this for 4 1/2 years. Is this still valid and/or is anyone still interested?

    @terryjreedy
    Copy link
    Member

    While Python is 'GPL compatible', whatever that means, it cannot incorporate GPLed code in the PSF distribution. Code must be contributed under one on the two licenses in the contributor agreement. Philip, can you contribute a patch appropriate to 3.x?

    In 3.x, robotparser is urllib.robotparser. Under the 'be generous what you accept' principle, expansion of accepted names would seem to be good.

    DOC PATCH NEEDED The doc says "For more details on the structure of robots.txt files, see http://www.robotstxt.org/orig.html ."
    That link seems not to exist. The safest link is to the site. The specific replacement is http://www.robotstxt.org/robotstxt.html .

    @terryjreedy
    Copy link
    Member

    The .../orig.html link now works and was last updated in August.
    It has a link to .../robotstext.html.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @iritkatriel
    Copy link
    Member

    This has been abandoned for over a decade. Marking as pending and will close it soon unless someone will object.

    @iritkatriel iritkatriel added the pending The issue will be closed if no feedback is provided label Apr 25, 2023
    @hauntsaninja hauntsaninja closed this as not planned Won't fix, can't repro, duplicate, stale May 1, 2023
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    pending The issue will be closed if no feedback is provided topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants