Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib.quote and unquote - Unicode issues #47550

Closed
mgiuca mannequin opened this issue Jul 6, 2008 · 80 comments
Closed

urllib.quote and unquote - Unicode issues #47550

mgiuca mannequin opened this issue Jul 6, 2008 · 80 comments
Assignees
Labels
release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@mgiuca
Copy link
Mannequin

mgiuca mannequin commented Jul 6, 2008

BPO 3300
Nosy @gvanrossum, @loewis, @orsenthil, @pitrou
Files
  • parse.py.patch: (obsolete) Patch fixing all three issues; commit log in comment
  • parse.py.patch2: (obsolete) Second patch (supersedes parse.py.patch); commit log in comment
  • parse.py.patch3: (obsolete) Third patch (supersedes parse.py.patch2); commit log in comment
  • parse.py.patch4: (obsolete) Fourth patch (supersedes parse.py.patch3); commit log in comment
  • parse.py.patch5: (obsolete) Fifth patch (supersedes parse.py.patch4); commit log in comment
  • parse.py.patch6: (obsolete) Sixth patch (supersedes parse.py.patch5); commit log in comment
  • parse.py.patch7: (obsolete) Seventh patch (supersedes parse.py.patch6); commit log in comment
  • parse.py.patch8: (obsolete) Eighth patch (supersedes parse.py.patch7); commit log in comment
  • parse.py.metapatch8: Diff between patch7 and patch8 (result of review).
  • patch
  • parse.py.patch8+allsafe: Patch8, and quote allows all characters in 'safe'
  • parse.py.patch9: Ninth patch (supersedes parse.py.patch8); commit log in comment for patch 8
  • parse.py.patch10: Tenth patch (supersedes parse.py.patch9); commit log in comment
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gvanrossum'
    closed_at = <Date 2008-08-18.21:45:16.480>
    created_at = <Date 2008-07-06.14:52:09.661>
    labels = ['type-bug', 'library', 'release-blocker']
    title = 'urllib.quote and unquote - Unicode issues'
    updated_at = <Date 2008-08-20.10:03:34.782>
    user = 'https://bugs.python.org/mgiuca'

    bugs.python.org fields:

    activity = <Date 2008-08-20.10:03:34.782>
    actor = 'mgiuca'
    assignee = 'gvanrossum'
    closed = True
    closed_date = <Date 2008-08-18.21:45:16.480>
    closer = 'gvanrossum'
    components = ['Library (Lib)']
    creation = <Date 2008-07-06.14:52:09.661>
    creator = 'mgiuca'
    dependencies = []
    files = ['10829', '10870', '10873', '10883', '10888', '11009', '11015', '11069', '11070', '11089', '11092', '11093', '11111']
    hgrepos = []
    issue_num = 3300
    keywords = ['patch']
    message_count = 80.0
    messages = ['69333', '69339', '69366', '69472', '69473', '69485', '69493', '69508', '69519', '69535', '69537', '69583', '69591', '70497', '70512', '70771', '70788', '70791', '70793', '70800', '70804', '70806', '70807', '70818', '70824', '70828', '70830', '70833', '70834', '70840', '70855', '70858', '70861', '70862', '70868', '70869', '70872', '70878', '70879', '70880', '70913', '70949', '70955', '70958', '70959', '70962', '70965', '70969', '70970', '71042', '71043', '71054', '71055', '71057', '71064', '71065', '71069', '71072', '71073', '71082', '71083', '71084', '71085', '71086', '71088', '71089', '71090', '71091', '71092', '71096', '71121', '71124', '71126', '71130', '71131', '71332', '71356', '71387', '71530', '71533']
    nosy_count = 8.0
    nosy_names = ['gvanrossum', 'loewis', 'jimjjewett', 'janssen', 'orsenthil', 'pitrou', 'thomaspinckney3', 'mgiuca']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'accepted'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue3300'
    versions = ['Python 3.0']

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Jul 6, 2008

    Three Unicode-related problems with urllib.parse.quote and
    urllib.parse.unquote in Python 3.0. (Patch attached).

    Firstly, unquote appears not to have been modified from Python 2, where
    it is designed to output a byte string. In Python 3, it outputs a
    unicode string, implicitly decoded as ISO-8859-1 (the code points are
    the same as the bytes). RFC 3986 states that the percent-encoded byte
    values should be decoded as UTF-8.

    http://tools.ietf.org/html/rfc3986 section 2.5.

    Current behaviour:
    >>> urllib.parse.unquote("%CE%A3")
    'Σ'
    (or '\u00ce\u00a3')
    
    Desired behaviour:
    >>> urllib.parse.unquote("%CE%A3")
    'Σ'
    (or '\u03a3')

    Secondly, while quote *has* been modified to encode to UTF-8 before
    percent-encoding, it does not work correctly for characters in
    range(128, 256), due to a special case in the code which again treats
    the code point values as byte values.

    Current behaviour:
    >>> urllib.parse.quote('\u00e9')
    '%E9'
    
    Desired behaviour:
    >>> urllib.parse.quote('\u00e9')
    '%C3%A9'

    Note that currently, quoting characters less than 256 will use
    ISO-8859-1, while quoting characters 256 or higher will use UTF-8!

    Thirdly, the "safe" argument to quote does not work for characters above
    256, since these are excluded from the special case. I thought I would
    fix this at the same time, but it's really a separate issue.

    Current behaviour:
    >>> urllib.parse.quote('Σϰ', safe='Σ')
    '%CE%A3%CF%B0'
    
    Desired behaviour:
    >>> urllib.parse.quote('Σϰ', safe='Σ')
    'Σ%CF%B0'

    A patch which fixes all three issues is attached. Note that unquote now
    needs to handle the case where the UTF-8 sequence is invalid. This is
    currently handled by "replace" (invalid sequences are replaced by
    '\ufffd'). I would like to add an optional "errors" argument to unquote,
    defaulting to "replace", to allow the user to override this behaviour,
    but I didn't put that in because it would change the interface.

    Note I also changed one of the test cases, which had the wrong expected
    output. (String literal was manually UTF-8 encoded, designed for Python
    2; nonsensical when viewed as a Python 3 Unicode string).

    All urllib test cases pass.

    Patch is for branch /branches/py3k, revision 64752.

    Note: The above unquote issue also manifests itself in Python 2 for
    Unicode strings, but it's hazy as to what the behaviour should be, and
    would break existing programs, so I'm just patching the Py3k branch.

    Commit log:

    urllib.parse.unquote: Fixed percent-encoded octets being implicitly
    decoded as ISO-8859-1; now decode as UTF-8, as per RFC 3986.

    urllib.parse.quote: Fixed characters in range(128, 256) being implicitly
    encoded as ISO-8859-1; now encode as UTF-8. Also fixed characters
    greater than 256 not responding to "safe", and also not being cached.

    Lib/test/test_urllib.py: Updated one test case for unquote which
    expected the wrong output. The new version of unquote passes the new
    test case.

    @mgiuca mgiuca mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jul 6, 2008
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jul 6, 2008

    RFC 3986 states that the percent-encoded byte
    values should be decoded as UTF-8.

    Where precisely do you read such a SHOULD requirement?
    Section 2.5 elaborates that the local encoding (of the
    resource) is typically used, ignoring cases where URIs
    are constructed on the client system (such scenario is
    simply ignored in the RFC).

    The last paragraph in section 2.5 is the only place that
    seems to imply a SHOULD requirement (although it doesn't
    use the keyword); this paragraph only talks about new URI
    schemes. Unfortunately, for http, the encoding is of
    characters is unspecified (this is somewhat solved by the
    introduction of IRIs).

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Jul 7, 2008

    Point taken. But the RFC certainly doesn't say that ISO-8859-1 should be
    used. Since we're outputting a Unicode string in Python 3, we need to
    decode with some encoding, and UTF-8 seems the most sensible and
    standardised.
    (Even the existing test case in test_urllib.py:466 uses a UTF-8-encoded
    URL, and I had to fix it so it decodes into a meaningful string).

    Having said that, it's possible that you may wish to use another
    encoding, and legal to do so. Therefore, I'd suggest we add an
    "encoding" argument to both quote and unquote, which defaults to "utf-8".

    Note that in the current implementation, unquote is not an inverse of
    quote, because quote uses UTF-8 to encode characters with code points >=
    256, while unquote decodes them as ISO-8859-1. I think it's important
    these two functions are inverses of each other.

    @thomaspinckney3
    Copy link
    Mannequin

    thomaspinckney3 mannequin commented Jul 9, 2008

    I mentioned this is in a brief python-dev discussion earlier this
    spring, but many popular websites such as Wikipedia and Facebook do use
    UTF-8 as their character encoding scheme for the path and argument
    portion of URLs.

    I know there's no RFC that says this is what should be done, but in
    order to make urllib work out-of-the-box on as many common websites as
    possible, I think defaulting to UTF-8 decoding makes a lot of sense.

    Possibly allow an option charset argument to be passed into quote and
    unquote, but default to UTF-8 in the absence of an explicit character
    set being passed in?

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Jul 9, 2008

    OK I've gone back over the patch and decided to add the "encoding" and
    "errors" arguments from the str.encode/decode methods as optional
    arguments to quote and unquote. This is a much bigger change than I
    originally intended, but I think it makes things much better because
    we'll get UTF-8 by default (which as far as I can tell is by far the
    most common encoding).

    (Tom Pinckney just made the same suggestion right as I'm typing this up!)

    So my new patch is a bit more extensive, and changes the interface (in a
    backwards-compatible way). Both quote and unquote now support "encoding"
    and "errors" arguments, defaulting to "utf-8" and "replace", respectively.

    Implementation detail: This changes the Quoter class a lot; it now
    hashes four fields to ensure it doesn't use the wrong cache.

    Also fixed an issue with the previous patch where non-ASCII-compatible
    encodings broke for code points < 128.

    I then ran the full test suite and discovered two other modules test
    cases broke. I've fixed them so the full suite passes, but I'm
    suspicious there may be more issues (see below).

    • Lib/test/test_http_cookiejar.py: A test case was written explicitly
      expecting Latin-1 encoding. I've changed this test case to expect UTF-8.
    • Lib/email/utils.py: I extensively analysed this code and discovered
      that it kind of "cheats" - it uses the Latin-1 encoding and treats it as
      octets, then applies its own encoding scheme. So to fix this, I changed
      the email module to call quote and unquote with encoding="latin-1".
      Hence it has the same behaviour as before.

    Some potential issues:

    • I have not updated the documentation yet. If this idea is to go ahead,
      the docs will need to show these new optional arguments. (I'll do that
      myself but haven't yet).
    • While the full test suite passes, I'm sure there will be many more
      issues since I've changed the interface. Therefore I don't recommend
      this patch is accepted just yet. I plan to do an investigation into all
      uses (within the standard lib) of quote and unquote to see if there are
      any other compatibility issues, particularly within urllib. Hence I'll
      respond to this again in a few days.
    • The new patch to "safe" argument of quote allows non-ASCII characters
      to be made safe. This correspondingly allows the construction of URIs
      with non-ASCII characters. Is it better to allow users to do this if
      they really want, or just mysteriously fail to let those characters through?

    I would also like to have a separate pair of functions, unquote_raw and
    quote_raw, which work on bytes objects instead of strings. (unquote_raw
    would take a str and produce a bytes, while quote_raw would take a bytes
    and produce a str). As URI encoding is fundamentally an octet encoding,
    not a character encoding, this is the only way to do URI encoding
    without choosing a Unicode character encoding. (I see some modules such
    as "email" treating the implicit Latin-1 encoding as byte encoding,
    which is a bit dodgy - they could benefit from raw functions). But as
    that requires further changes to the interface, I'll save it for another
    day.

    Patch (parse.py.patch2) is for branch /branches/py3k, revision 64820.

    Commit log:

    urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
    allowing the caller to determine the decoding of percent-encoded octets
    (previously implicitly decoded as ISO-8859-1). As per RFC 3986, default
    is "utf-8".

    urllib.parse.quote: Added "encoding" and "errors" optional arguments,
    allowing the caller to determine the encoding of non-ASCII characters
    before being percent-encoded (previously characters in range(128, 256)
    were encoded as ISO-8859-1, and characters above that as UTF-8). Also
    fixed characters greater than 256 not responding to "safe", and also not
    being cached.

    Lib/test/test_urllib.py, Lib/test/test_http_cookiejar.py: Updated test
    cases which expected output in ISO-8859-1, now expects UTF-8.

    Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
    with encoding="latin-1", to preserve existing behaviour (which the whole
    email module is dependent upon).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jul 9, 2008

    Assuming the patch is acceptable in the first place (which I personally
    have not made my mind up), then it lacks documentation and test suite
    changes.

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Jul 10, 2008

    OK well here are the necessary changes to the documentation (RST docs
    and docstrings in the code).

    As I said above, I plan to to extensive testing and add new cases, and I
    don't recommend this patch is accepted until that's done.

    Patch (parse.py.patch3) is for branch /branches/py3k, revision 64834.

    Commit log:

    urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
    allowing the caller to determine the decoding of percent-encoded octets
    (previously implicitly decoded as ISO-8859-1). As per RFC 3986, default
    is "utf-8".

    urllib.parse.quote: Added "encoding" and "errors" optional arguments,
    allowing the caller to determine the encoding of non-ASCII characters
    before being percent-encoded (previously characters in range(128, 256)
    were encoded as ISO-8859-1, and characters above that as UTF-8). Also
    fixed characters greater than 256 not responding to "safe", and also not
    being cached.

    Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
    reflect new interface.

    Lib/test/test_urllib.py, Lib/test/test_http_cookiejar.py: Updated test
    cases which expected output in ISO-8859-1, now expects UTF-8.

    Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
    with encoding="latin-1", to preserve existing behaviour (which the whole
    email module is dependent upon).

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Jul 10, 2008

    Setting Version back to Python 3.0. Is there a reason it was set to
    Python 3.1? This proposal will certainly break a lot of code. It's *far*
    better to do it in the big backwards-incompatible Python 3.0 release
    than a later release.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jul 10, 2008

    Setting Version back to Python 3.0. Is there a reason it was set to
    Python 3.1?

    3.0b1 has been released, so no new features can be added to 3.0.

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Jul 11, 2008

    3.0b1 has been released, so no new features can be added to 3.0.

    While my proposal is no doubt going to cause a lot of code breakage, I
    hardly consider it a "new feature". This is very definitely a bug. As I
    understand it, the point of a code freeze is to stop the addition of
    features which could be added to a later version. Realistically, there
    is no way this issue can be fixed after 3.0 is released, as it
    necessarily involves changing the behaviour of this function.

    Perhaps I should explain further why this is a regression from Python
    2.x and not a feature request. In Python 2.x, with byte strings, the
    encoding is not an issue. quote and unquote simply encode bytes, and if
    you want to use Unicode you have complete control. In Python 3.0, with
    Unicode strings, if functions manipulate string objects, you don't have
    control over the encoding unless the functions give you explicit
    control. So Python 3.0's native Unicode strings have broken the library.

    I give two examples.

    Firstly, I believe that unquote(quote(x)) should always be true for all
    strings x. In Python 2.x, this is always trivially true (for non-Unicode
    strings), because they simply encode and decode the octets. In Python
    3.0, the two functions are inconsistent, and break out of the range(0, 256).

    >>> urllib.parse.unquote(urllib.parse.quote('ÿ')) # '\u00ff'
    'ÿ'
    # Works, because both functions work with ISO-8859-1 in this range.
    
    >>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100'
    'Ä\x80'
    # Fails, because quote uses UTF-8 and unquote uses ISO-8859-1.
    
    My patch succeeds for all characters.
    >>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100'
    'Ā'

    Secondly, a bigger example, but I want to demonstrate how this bug
    affects web applications, even very simple ones.

    Consider this simple (beginnings of a) wiki system in Python 2.5, as a
    CGI app:

    #---

    import cgi
    
    fields = cgi.FieldStorage()
    title = fields.getfirst('title')
    
    print("Content-Type: text/html; charset=utf-8")
    print("")
    
    print('<p>Debug: %s</p>' % repr(title))
    if title is None:
        print("No article selected")
    else:
        print('<p>Information about %s.</p>' % cgi.escape(title))
    #

    (Place this in cgi-bin, navigate to it, and add the query string
    "?title=Page Title"). I'll use the page titled "Mátt" as a test case.

    If you navigate to "?title=Mátt", it displays the text "Debug:
    'M\xc3\xa1tt'. Information about Mátt.". The browser (at least Firefox,
    Safari and IE I have tested) encodes this as "?title=M%C3%A1tt". So this
    is trivial, as it's just being unquoted into a raw byte string
    'M\xc3\xa1tt', then written out again as a byte string.

    Now consider that you want to manipulate it as a Unicode string, still
    in Python 2.5. You could augment the program to decode it as UTF-8 and
    then re-encode it. (I wrote a simple UTF-8 printing function which takes
    Unicode strings as input).

    #---

    import sys
    import cgi
    
    def printu8(*args):
        """Prints to stdout encoding as utf-8, rather than the current terminal
        encoding. (Not a fully-featured print function)."""
        sys.stdout.write(' '.join([x.encode('utf-8') for x in args]))
        sys.stdout.write('\n')
    
    fields = cgi.FieldStorage()
    title = fields.getfirst('title')
    if title is not None:
        title = str(title).decode("utf-8", "replace")
    
    print("Content-Type: text/html; charset=utf-8")
    print("")
    
    print('<p>Debug: %s.</p>' % repr(title))
    if title is None:
        print("No article selected.")
    else:
        printu8('<p>Information about %s.</p>' % cgi.escape(title))
    #

    Now given the same input ("?title=Mátt"), it displays "Debug:
    u'M\xe1tt'. Information about Mátt." Still working fine, and I can
    manipulate it as Unicode because in Python 2.x I have direct control
    over encoding/decoding.

    Now let us upgrade this program to Python 3.0. (Note that I still can't
    print Unicode characters directly out, because running through Apache
    the stdout encoding is not UTF-8, so I use my printu8 function).

    #---

    import sys
    import cgi
    
    def printu8(*args):
        """Prints to stdout encoding as utf-8, rather than the current terminal
        encoding. (Not a fully-featured print function)."""
        sys.stdout.buffer.write(b' '.join([x.encode('utf-8') for x in args]))
        sys.stdout.buffer.write(b'\n')
    
    fields = cgi.FieldStorage()
    title = fields.getfirst('title')
    # Note: No call to decode. I have no opportunity to specify the encoding
    since
    # it comes straight out of FieldStorage as a Unicode string.
    
    print("Content-Type: text/html; charset=utf-8")
    print("")
    
    print('<p>Debug: %s.</p>' % ascii(title))
    if title is None:
        print("No article selected.")
    else:
        printu8('<p>Information about %s.</p>' % cgi.escape(title))
    #

    Now given the same input ("?title=Mátt"), it displays "Debug:
    'M\xc3\xa1tt'. Information about Mátt." Once again, it is erroneously
    (and implicitly) decoded as ISO-8859-1, so I end up with a meaningless
    Unicode string. The only possible thing I can do about this as a web
    developer is call title.encode('latin-1').decode('utf-8') - a dreadful hack.

    With my patch applied, the input ("?title=Mátt") produces the output
    "Debug: 'M\xe1tt'. Information about Mátt."

    Basically, this bug is going to affect all web developers as soon as
    someone types a non-ASCII character. You could argue that supporting
    UTF-8 by default is no better than supporting Latin-1 by default, but it
    is. UTF-8 supports encoding of all characters where Latin-1 does not,
    UTF-8 is the recommended URI encoding by both the URI Syntax RFC[1] and
    the W3C HTML 4.01 specification[2], and all major browsers use it to
    encode non-ASCII characters in URIs.

    My patch may not be the best, or most conservative, solution to this
    problem. I'm happy to see other proposals. But it's clearly an important
    bug to fix, if I can't even write the simplest web app I can think of
    without having to use a kludgey hack to get the string decoded
    correctly. What is the point of having nice clean Unicode strings in the
    language if the library spits out the wrong characters and it requires
    more work to fix them than it used to with byte strings?

    [1] http://tools.ietf.org/html/rfc3986#section-2.5
    [2] http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Jul 11, 2008

    Since I got a complaint that my last reply was too long, I'll summarize it.

    It's a bug report, not a feature request.

    I can't get a simple web app to be properly Unicode-aware in Python 3,
    which worked fine in Python 2. This cannot be put off until 3.1, as any
    viable solution will break existing code.

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Jul 12, 2008

    OK I spent awhile writing test cases for quote and unquote, encoding and
    decoding various Unicode strings with different encodings. As a result,
    I found a bunch of issues in my previous patch, so I've rewritten the
    patches to both quote and unquote. They're both actually more similar to
    the original version now.

    I'd be interested in hearing if anyone disagrees with my expected output
    for these test cases.

    I'm now confident I have good test coverage directly on the quote and
    unquote functions. However, I haven't tested the other library functions
    which depend upon them (though the entire test suite passes). Though as
    I showed in that big post I made yesterday, other modules such as cgi
    seem to be working fine (their behaviour has changed; they use UTF-8
    now; but that's the whole point of this patch).

    I still haven't figured out what the behaviour of "safe" should be in
    quote. Should it only allow ASCII characters (thereby limiting the
    output to an ASCII string, as specified by RFC 3986)? Should it also
    allow Latin-1 characters, or all Unicode characters as well (perhaps
    allowing you to create IRIs -- admittedly I don't know much about IRIs).
    The new implementation of quote makes it rather difficult to allow
    non-Latin-1 characters to be made "safe", as it encodes the string into
    bytes before any processing.

    Patch (parse.py.patch4) is for branch /branches/py3k, revision 64891.

    Commit log:

    urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
    allowing the caller to determine the decoding of percent-encoded octets.
    As per RFC 3986, default is "utf-8" (previously implicitly decoded as
    ISO-8859-1).

    urllib.parse.quote: Added "encoding" and "errors" optional arguments,
    allowing the caller to determine the encoding of non-ASCII characters
    before being percent-encoded. Default is "utf-8" (previously characters
    in range(128, 256) were encoded as ISO-8859-1, and characters above that
    as UTF-8). Also characters above 128 are no longer allowed to be "safe".

    Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
    reflect new interface.

    Lib/test/test_urllib.py: Added several new test cases testing encoding
    and decoding Unicode strings with various encodings. This includes
    updating one test case to now expect UTF-8 by default.

    Lib/test/test_http_cookiejar.py: Updated test case which expected output
    in ISO-8859-1, now expects UTF-8.

    Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
    with encoding="latin-1", to preserve existing behaviour (which the whole
    email module is dependent upon).

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Jul 12, 2008

    So today I grepped for "urllib" in the entire library in an effort to
    track down every dependency on quote and unquote to see exactly how my
    patch breaks other code. I've now investigated every module in the
    library which uses quote, unquote or urlencode, and my findings are
    documented below in detail.

    So far I have found no code "breakage" except for the original
    email.util issue I fixed in patch 2. Of course that doesn't mean the
    behaviour hasn't changed. Nearly all modules in the report below have
    changed their behaviour so they used to deal with Latin-1-encoded URLs
    and now deal with UTF-8-encoded URLs. As discussed at length above, I
    see this as a positive change, since nearly everybody encodes URLs in
    UTF-8, and of course it allows for all characters.

    I also point out that the http.server module (unpatched) is internally
    broken when dealing with filenames with characters outside range(0,256);
    my patch fixes it.

    I'm attaching patch 5, which adds a bunch of new test cases to various
    modules which demonstrate those modules correctly handling UTF-8-encoded
    URLs. It also fixes a bug in email.utils which I introduced in patch 2.

    Note that I haven't yet fully investigated urllib.request.

    Aside from that, the only remaining matter is whether or not it's better
    to encode URLs as UTF-8 or Latin-1 by default, and I'm pretty sure that
    question doesn't need debate.

    So basically I think if there's support for it, this patch is just about
    ready to be accepted. I'm hoping it can be included in the 3.0b2 release
    next week.

    I'd be glad to hear any feedback about this proposal.

    Not Yet Investigated
    --------------------

    ./urllib/request.py
    By far the biggest user of quote and unquote.
    username, password, hostname and paths are now all converted
    to/from UTF-8 percent-encodings.
    Other concerns are:
    * Data in the form application/x-www-form-urlencoded
    * FTP access
    I think this needs to be tested further.

    Looks fine, not tested
    ----------------------

    ./xmlrpc/client.py
    Just used to decode URI auth string (user:pass). This will change
    to UTF-8, but is probably OK.
    ./logging/handlers.py
    Just uses it in the HTTP handler to encode a dictionary. Probably
    preferable to use UTF-8 to encode an arbitrary string.
    ./macurl2path.py
    Calls to urllib look broken. Not tested.

    Tested manually, fine
    ---------------------

    ./wsgiref/simple_server.py
    Just used to set PATH_INFO, fine if URLs are UTF-8 encoded.
    ./http/server.py
    All uses are for translating between actual file-system paths to
    URLs. This works fine for UTF-8 URLs. Note that since it uses
    quote to create URLs in a dir listing, and unquote to handle
    them, it breaks when unquote is not the inverse of quote.

    Consider the following simple script:
    
        import http.server
        s = http.server.HTTPServer(('',8000),
                http.server.SimpleHTTPRequestHandler)
        s.serve_forever()
    This will "kind of" work in the unpatched version, using
    Latin-1 URLs, but filenames with characters above 256 will
    break (give a 404 error).
    The patch fixes this.
    

    ./urllib/robotparser.py
    No test cases. Manually tested, URLs properly match when
    percent-encoded in UTF-8.
    ./nturl2path.py
    No test cases available. Manually tested, fine if URLs are
    UTF-8 encoded.

    Test cases either exist or added, fine
    --------------------------------------

    ./test/test_urllib.py
    I wrote a large wad of test cases for all the new functionality.
    ./wsgiref/util.py
    Added test cases expecting UTF-8.
    ./http/cookiejar.py
    I changed a test case to expect UTF-8.
    ./email/utils.py
    I changed this file to behave as it used to, to satisfy its
    existing test cases.
    ./cgi.py
    Added test cases for UTF-8-encoded query strings.

    Commit log:

    urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
    allowing the caller to determine the decoding of percent-encoded octets.
    As per RFC 3986, default is "utf-8" (previously implicitly decoded as
    ISO-8859-1).

    urllib.parse.quote: Added "encoding" and "errors" optional arguments,
    allowing the caller to determine the encoding of non-ASCII characters
    before being percent-encoded. Default is "utf-8" (previously characters
    in range(128, 256) were encoded as ISO-8859-1, and characters above that
    as UTF-8). Also characters above 128 are no longer allowed to be "safe".

    Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
    reflect new interface.

    Lib/test/test_urllib.py: Added several new test cases testing encoding
    and decoding Unicode strings with various encodings. This includes
    updating one test case to now expect UTF-8 by default.

    Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
    Lib/test/test_wsgiref.py: Updated and added test cases to deal with
    UTF-8-encoded URIs.

    Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
    with encoding="latin-1", to preserve existing behaviour (which the whole
    email module is dependent upon).

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Jul 31, 2008

    OK after a long discussion on the mailing list, Guido gave this the OK,
    with the provision that there are str->bytes and bytes->str versions of
    these functions as well. So I've written those.

    http://mail.python.org/pipermail/python-dev/2008-July/081601.html

    quote itself now accepts either a str or a bytes. quote_from_bytes is a
    new function which is just an alias for quote. (Is this acceptable?)

    unquote is still str->str. I've added a totally separate function
    unquote_to_bytes which is str->bytes.

    Note there is a slight issue here: I didn't quite know what to do with
    unescaped non-ASCII characters in the input to unquote_to_bytes - they
    need to somehow be converted to bytes. I chose to encode them using
    UTF-8, on the basis that they technically shouldn't be in a URI anyway.

    Note that my new unquote doesn't have this problem; it's carefully
    written to preserve the Unicode characters, even if they aren't
    expressible in the given encoding (which explains some of the code bloat).

    This makes unquote(s, encoding=e) necessarily more robust than
    unquote_to_bytes(s).decode(e) in terms of unescaped non-ASCII characters
    in the input.

    I've also added new test cases and documentation for these two new
    functions (included in patch6).

    On an entirely personal note, can whoever checks this in please mention
    my name in the commit log - I've put in at least 30 hours researching
    and writing this patch, and I'd like for this not to go uncredited :)

    Commit log for patch6:

    Fix for bpo-3300.

    urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
    allowing the caller to determine the decoding of percent-encoded octets.
    As per RFC 3986, default is "utf-8" (previously implicitly decoded as
    ISO-8859-1).

    urllib.parse.quote: Added "encoding" and "errors" optional arguments,
    allowing the caller to determine the encoding of non-ASCII characters
    before being percent-encoded. Default is "utf-8" (previously characters
    in range(128, 256) were encoded as ISO-8859-1, and characters above that
    as UTF-8). Also characters/bytes above 128 are no longer allowed to be
    "safe". Also now allows either bytes or strings.

    Added functions urllib.parse.quote_from_bytes,
    urllib.parse.unquote_to_bytes.

    Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
    reflect new interface, added quote_from_bytes and unquote_to_bytes.

    Lib/test/test_urllib.py: Added several new test cases testing encoding
    and decoding Unicode strings with various encodings, as well as testing
    the new functions.

    Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
    Lib/test/test_wsgiref.py: Updated and added test cases to deal with
    UTF-8-encoded URIs.

    Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
    with encoding="latin-1", to preserve existing behaviour (which the whole
    email module is dependent upon).

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Jul 31, 2008

    Hmm ... seems patch 6 I just checked in fails a test case! Sorry! (It's
    minor, gives a harmless BytesWarning if you run with -b, which "make
    test" does, so I only picked it up after submitting).

    I've slightly changed the code in quote so it doesn't do that any more
    (it normalises all "safe" arguments to bytes).

    Please review patch 7, not 6. Same commit log as above.

    (Also .. someone let me know if I'm not submitting patches properly,
    like perhaps I should be deleting the old ones not keeping them around?)

    @janssen
    Copy link
    Mannequin

    janssen mannequin commented Aug 6, 2008

    Here's my version of how quote and unquote should be implemented in
    Python 3.0. I haven't looked at the uses of it in the library, but I'd
    expect improper uses (and there are lots of them) will break, and thus
    can be fixed.

    Basically, percent-quoting is about creating an ASCII string that can be
    safely used in URI from an arbitrary sequence of octets. So, my version
    of quote() takes either a byte sequence or a string, and percent-quotes
    the unsafe ones, and then returns a str. If a str is supplied on input,
    it is first converted to UTF-8, then the octets of that encoding are
    percent-quoted.

    For unquote, there's no way to tell what the octets of the quoted
    sequence may mean, so this takes the percent-quoted ASCII string, and
    returns a byte sequence with the unquoted bytes. For convenience, since
    the unquoted bytes are often a string in some particular character set
    encoding, I've also supplied unquote_as_string(), which takes an
    optional character set, and first unquotes the bytes, then converts them
    to a str, using that character set encoding, and returns the resulting
    string.

    @janssen
    Copy link
    Mannequin

    janssen mannequin commented Aug 6, 2008

    Here's a patch to parse.py (and test/test_urllib.py) that makes the
    various tests (cgi, urllib, httplib) pass. It basically adds
    "unquote_as_string", "unquote_as_bytes", "quote_as_string",
    "quote_as_bytes", and then define the existing "quote" and "unquote" in
    terms of them.

    @jimjjewett
    Copy link
    Mannequin

    jimjjewett mannequin commented Aug 6, 2008

    Is there still disagreement over anything except:

    (1) The type signature of quote and unquote (as opposed to the
    explicit "quote_as_bytes" or "quote_as string").

    (2) The default encoding (latin-1 vs UTF8), and (if UTF-8) what to do
    with invalid byte sequences?

    (3) Would waiting for 3.1 cause too many compatibility problems?

    @pitrou
    Copy link
    Member

    pitrou commented Aug 6, 2008

    Bill, I haven't studied your patch in detail but a few comments:

    • it would be nice to have more unit tests, especially for the various
      bytes/unicode possibilities, and perhaps also roundtripping (Matt's
      patch has a lot of tests)
    • quote_as_bytes() should return a bytes object, not a bytearray
    • using the "%02X" format looks clearer to me than going through the
      _hextable lookup table...
    • when the argument is of the wrong type, quote_as_bytes() should raise
      a TypeError rather than a ValueError
    • why is quote_as_string() hardwired to utf8 while unquote_as_string()
      provides a charset parameter? wouldn't it be better for them to be
      consistent with each other?

    @gvanrossum
    Copy link
    Member

    Bill Janssen's "patch" breaks two unittests: test_email and
    test_http_cookiejar. Details for test_email:

    ======================================================================
    ERROR: test_rfc2231_bad_character_in_filename
    (email.test.test_email.TestRFC2231)
    .
    .
    .
    File "/usr/local/google/home/guido/python/py3k/Lib/urllib/parse.py",
    line 267, in unquote_as_string
    return str(unquote_as_bytes(s, plus=plus), charset, 'strict')
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 13:
    unexpected end of data

    ======================================================================
    FAIL: test_rfc2231_bad_character_in_charset
    (email.test.test_email.TestRFC2231)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File
    "/usr/local/google/home/guido/python/py3k/Lib/email/test/test_email.py",
    line 3279, in test_rfc2231_bad_character_in_charset
        self.assertEqual(msg.get_content_charset(), None)
    AssertionError: 'utf-8\\u201d' != None

    Details for test_http_cookiejar:

    ======================================================================
    FAIL: test_url_encoding (main.LWPCookieTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "Lib/test/test_http_cookiejar.py", line 1454, in test_url_encoding
        self.assert_("foo=bar" in cookie and version_re.search(cookie))
    AssertionError: None

    @jimjjewett
    Copy link
    Mannequin

    jimjjewett mannequin commented Aug 6, 2008

    Matt pointed out that the email package assumes Latin-1 rather than UTF-8; I
    assume Bill could patch his patch the same way Matt did, and this would
    resolve the email tests. (Unless you pronounce to stick with Latin-1)

    The cookiejar failure probably has the same root cause; that test is
    encoding (non-ASCII) Latin-1 characters, and urllib.parse.py/Quoter assumes
    Latin-1.

    So I see some evidence (probably not enough) for sticking with Latin-1
    instead of UTF-8. But I don't see any evidence that fixing the semantics
    (encoded results should be bytes) at the same time made the conversion any
    more painful.

    On the other hand, Matt shows that some of those extra str->byte code
    changes might never need to be done at all, except for purity.

    @gvanrossum
    Copy link
    Member

    Dear GvR,

    New code review comments by GvR have been published.
    Please go to http://codereview.appspot.com/2827 to read them.

    Message:
    Hi Matt,

    Here's a code review of your patch.

    I'm leaning more and more towards wanting this for 3.0, but I have some API
    design issues and also some style nits.

    I'm cross-linking this with the Python tracker issue, through the subject.

    Details:

    http://codereview.appspot.com/2827/diff/1/2
    File Doc/library/urllib.parse.rst (right):

    http://codereview.appspot.com/2827/diff/1/2#newcode198
    Line 198: replaced by a placeholder character.
    I don't think that's a good default. I'd rather see it default to strict --
    that's what encoding translates to everywhere else. I believe that lenient
    error handling by default can cause subtle security problems too, by hiding
    problem characters from validation code.

    http://codereview.appspot.com/2827/diff/1/2#newcode215
    Line 215: An alias for :func:`quote`, intended for use with a :class:`bytes`
    object
    I'd rather see this as a wrapper that raises TypeError if the argument
    isn't a
    bytes or bytearray instance. Otherwise it's needless redundancy.

    http://codereview.appspot.com/2827/diff/1/2#newcode223
    Line 223: Replace %xx escapes by their single-character equivalent.
    Should add what the argument type is -- I vote for str or bytes/bytearray.

    http://codereview.appspot.com/2827/diff/1/2#newcode242
    Line 242: .. function:: unquote_to_bytes(string)
    Again, add what the argument type is.

    http://codereview.appspot.com/2827/diff/1/4
    File Lib/email/utils.py (right):

    http://codereview.appspot.com/2827/diff/1/4#newcode224
    Line 224: except:
    An unqualified except clause is unacceptable here. Why do you need this
    anyway?

    http://codereview.appspot.com/2827/diff/1/5
    File Lib/test/test_http_cookiejar.py (right):

    http://codereview.appspot.com/2827/diff/1/5#newcode1450
    Line 1450: "%3c%3c%0Anew%C3%A5/%C3%A5",
    I'm guessing this test broke otherwise? Given that this references an RFC,
    is
    it correct to just fix it this way?

    http://codereview.appspot.com/2827/diff/1/3
    File Lib/urllib/parse.py (right):

    http://codereview.appspot.com/2827/diff/1/3#newcode10
    Line 10: "urlsplit", "urlunsplit"]
    Please add all the quote/unquote versions here too.
    (They were there in 2.5, but somehow got dropped from 3.0.

    http://codereview.appspot.com/2827/diff/1/3#newcode265
    Line 265: # Maps lowercase and uppercase variants (but not mixed case).
    That sounds like a disaster. Why would %aa and %AA be correct but not %aA
    and
    %Aa? (Even though the old code had the same problem.)

    http://codereview.appspot.com/2827/diff/1/3#newcode283
    Line 283: def unquote(s, encoding = "utf-8", errors = "replace"):
    Please no spaces around the '=' when used for an argument default (or for a
    keyword arg).

    Also see my comment about defaulting to 'replace' in the doc file.

    Finally -- let's be consistent about quotes. It seems most of this file
    uses
    single quotes, so lets stick to that (except docstrings always use double
    quotes).

    And more: what should a None value for encoding or errors mean? IMO it
    should
    mean "use the default".

    http://codereview.appspot.com/2827/diff/1/3#newcode382
    Line 382: safe = safe.encode('ascii', 'ignore')
    Using errors='ignore' seems like a mistake -- it will hide errors.

    I also wonder why safe should be limited to ASCII though.

    http://codereview.appspot.com/2827/diff/1/3#newcode399
    Line 399: if ' ' in s:
    This test means that it won't work if the input is bytes. E.g.

    urllib.parse.quote_plus(b"abc def")

    raises a TypeError.

    Sincerely,

    Your friendly code review daemon (http://codereview.appspot.com/).

    @jimjjewett
    Copy link
    Mannequin

    jimjjewett mannequin commented Aug 6, 2008

    http://codereview.appspot.com/2827/diff/1/5#newcode1450
    Line 1450: "%3c%3c%0Anew%C3%A5/%C3%A5",
    I'm guessing this test broke otherwise?

    Yes; that is one of the breakages you found in Bill's patch. (He didn't
    modify the test.)

    Given that this references an RFC,
    is it correct to just fix it this way?

    Probably. Looking at http://www.faqs.org/rfcs/rfc2965.html

    (1) That is not among the exact tests in the RFC.
    (2) The RFC does not specify charset for the cookie in general, but the
    Comment field MUST be in UTF-8, and the only other reference I could find to
    a specific charset was "possibly in a server-selected printable ASCII
    encoding."

    Whether we have to use Latin-1 (or document charset) in practice for
    compatibility reasons, I don't know.

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Aug 7, 2008

    Dear GvR,

    New code review comments by mgiuca have been published.
    Please go to http://codereview.appspot.com/2827 to read them.

    Message:
    Hi Guido,

    Thanks very much for this very detailed review. I've replied to the
    comments. I will make the changes as described below and send a new
    patch to the tracker.

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Aug 7, 2008

    A reply to a point on GvR's review, I'd like to open for discussion.
    This relates to whether or not quote's "safe" argument should allow
    non-ASCII characters.

    Using errors='ignore' seems like a mistake -- it will hide errors. I >
    also wonder why safe should be limited to ASCII though.

    The reasoning is this: if we allow non-ASCII characters to be escaped,
    then we allow quote to generate invalid URIs (URIs are only allowed to
    have ASCII characters). It's one thing for unquote to accept such URIs,
    but I think we shouldn't be producing them. Albeit, it only produces an
    invalid URI if you explicitly request it. So I'm happy to make the
    change to allow any character to be safe, but I'll let it go to
    discussion first.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 7, 2008

    Le jeudi 07 août 2008 à 13:42 +0000, Matt Giuca a écrit :

    The reasoning is this: if we allow non-ASCII characters to be escaped,
    then we allow quote to generate invalid URIs (URIs are only allowed to
    have ASCII characters). It's one thing for unquote to accept such URIs,
    but I think we shouldn't be producing them. Albeit, it only produces an
    invalid URI if you explicitly request it. So I'm happy to make the
    change to allow any character to be safe, but I'll let it go to
    discussion first.

    The important is that the defaults are safe. If users want to override
    the defaults and produce potentially invalid URIs, there is no reason to
    discourage them.

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Aug 7, 2008

    The important is that the defaults are safe. If users want to override
    the defaults and produce potentially invalid URIs, there is no reason to
    discourage them.

    OK I think that's a fairly valid argument. I'm about to head off so I'll
    post the patch I have now, which fixes most of the other concerns. That
    change will cause havoc to quote I think ;)

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Aug 7, 2008

    Following Guido and Antoine's reviews, I've written a new patch which
    fixes *most* of the issues raised. The ones I didn't fix I have noted
    below, and commented on the review site
    (http://codereview.appspot.com/2827/). Note: I intend to address all of
    these issues after some discussion.

    Outstanding issues raised by the reviews:

    Doc/library/urllib.parse.rst:
    Should unquote accept a bytes/bytearray as well as a str?

    Lib/email/utils.py:
    Should encode_rfc2231 with charset=None accept strings with non-ASCII
    characters, and just encode them to UTF-8?

    Lib/test/test_http_cookiejar.py:
    Does RFC 2965 let me get away with changing the test case to expect
    UTF-8? (I'm pretty sure it doesn't care what encoding is used).

    Lib/test/test_urllib.py:
    Should quote raise a TypeError if given a bytes with encoding/errors
    arguments? (Motivation: TypeError is what you usually raise if you
    supply too many args to a function).

    Lib/urllib/parse.py:
    (As discussed above) Should quote accept safe characters outside the
    ASCII range (thereby potentially producing invalid URIs)?

    ------

    Commit log for patch8:

    Fix for bpo-3300.

    urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
    allowing the caller to determine the decoding of percent-encoded octets.
    As per RFC 3986, default is "utf-8" (previously implicitly decoded as
    ISO-8859-1). Also fixed a bug in which mixed-case hex digits (such as
    "%aF") weren't being decoded at all.

    urllib.parse.quote: Added "encoding" and "errors" optional arguments,
    allowing the caller to determine the encoding of non-ASCII characters
    before being percent-encoded. Default is "utf-8" (previously characters
    in range(128, 256) were encoded as ISO-8859-1, and characters above that
    as UTF-8). Also characters/bytes above 128 are no longer allowed to be
    "safe". Also now allows either bytes or strings.

    Added functions urllib.parse.quote_from_bytes,
    urllib.parse.unquote_to_bytes. All quote/unquote functions now exported
    from the module.

    Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
    reflect new interface, added quote_from_bytes and unquote_to_bytes.

    Lib/test/test_urllib.py: Added many new test cases testing encoding
    and decoding Unicode strings with various encodings, as well as testing
    the new functions.

    Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
    Lib/test/test_wsgiref.py: Updated and added test cases to deal with
    UTF-8-encoded URIs.

    Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
    with encoding="latin-1", to preserve existing behaviour (which the whole
    email module is dependent upon).

    @janssen
    Copy link
    Mannequin

    janssen mannequin commented Aug 12, 2008

    Larry Masinter is off on vacation, but I did get a brief message saying
    that he will dig up similar discussions that he was involved in when he
    gets back.

    Out of curiosity, I sent a note off to the www-international mailing
    list, and received this:

    ``For the authority (server name) portion of a URI, RFC 3986 is pretty
    clear that UTF-8 must be used for non-ASCII values (assuming, for a
    moment, that IDNA addresses are not Punycode encoded already). For the
    path portion of URIs, a large-ish proportion of them are, indeed, UTF-8
    encoded because that has been the de facto standard in Web browsers for
    a number of years now. For the query and fragment parts, however, the
    encoding is determined by context and often depends on the encoding of
    some page that contains the form from which the data is taken. Thus, a
    large number of URIs contain non-UTF-8 percent-encoded octets.''

    http://lists.w3.org/Archives/Public/www-international/2008JulSep/0041.html

    @janssen
    Copy link
    Mannequin

    janssen mannequin commented Aug 12, 2008

    For Antoine:

    I think the problem that Barry is facing with the email package is that
    Unicode strings are an ambiguous representation of a sequence of bytes;
    that is, there are a number of different byte sequences a Unicode string
    may have come from. His ingenious use of raw-unicode-escape is an
    attempt to conform to the requirement of having to produce a string, but
    without losing any data, so that an application program can, if it needs
    to, still reprocess that string and retrieve the original data. Naive
    application programs that sort of expected the result to be an ASCII
    string will be unaffected. Not sure it's the best idea; this is all
    about just where to force unexpected runtime failures.

    @janssen
    Copy link
    Mannequin

    janssen mannequin commented Aug 12, 2008

    Here's another thought:

    Let's put string_to_bytes and string_from_bytes into the binascii
    module, as a2b_percent and b2a_percent, respectively.

    Then parse.py would import them as

      from binascii import a2b_percent as percent_decode_as_bytes
      from binascii import b2a_percent as percent_encode_from_bytes

    and add two more functions:

      def percent_encode(<string>, encoding="UTF-8", error="strict", plus=False)
      def percent_decode(<string>, encoding="UTF-8", error="strict", plus=False)

    and would add backwards-compatible but deprecated functions for quote
    and unquote:

      def quote(s):
          warnings.warn("urllib.parse.quote should be replaced by
    percent_encode or percent_encode_from_bytes", FutureDeprecationWarning)
          if isinstance(s, str):
              return percent_encode(s)
          else:
              return percent_encode_from_bytes(s)
    
      def unquote(s):
          warnings.warn("urllib.parse.unquote should be replaced by
    percent_decode or percent_decode_to_bytes", FutureDeprecationWarning)
          if isinstance(s, str):
              return percent_decode(s)
          else:
              return percent_decode(str(s, "ASCII", "strict"))

    @pitrou
    Copy link
    Member

    pitrou commented Aug 12, 2008

    Le mardi 12 août 2008 à 19:37 +0000, Bill Janssen a écrit :

    Let's put string_to_bytes and string_from_bytes into the binascii
    module, as a2b_percent and b2a_percent, respectively.

    Well, it's my personal opinion, but I think we should focus on a simple
    and straightforward solution for the present issue before beta3 is
    released (which is in 8 days now). It has already been difficult to find
    a (quasi-)consensus for a simple patch to adapt quote()/unquote() to the
    realities of bytes/unicode separation in py3k: witness the length of the
    present discussion.

    (perhaps a sophisticated solution could still be adopted for 3.1,
    especially if it has backwards compatibility in mind)

    @gvanrossum
    Copy link
    Member

    Matt Giuca <matt.giuca@gmail.com> added the comment:
    By the way, what is the current status of this bug? Is anybody waiting
    on me to do anything? (Re: Patch 9)

    I'll be reviewing it today or tomorrow. From looking at it briefly I
    worry that the implementation is pretty slow -- a method call for each
    character and a map() call sounds pretty bad.

    To recap my previous list of outstanding issues raised by the review:

    > Should unquote accept a bytes/bytearray as well as a str?
    Currently, does not. I think it's meaningless to do so (and how to
    handle >127 bytes, if so?)

    The bytes > 127 would be translated as themselves; this follows
    logically from how stuff is parsed -- %% and %FF are translated,
    everything else is not. But I don't really care, I doubt there's a
    need.

    > Lib/email/utils.py:
    > Should encode_rfc2231 with charset=None accept strings with non-ASCII
    > characters, and just encode them to UTF-8?
    Currently does. Suggestion to restrict to ASCII on the review tracker;
    simple fix.

    I think I agree with that comment; it seems wrong to return UTF8
    without setting that in the header. The alternative would be to
    default charset to utf8 if there are any non-ASCII chars in the input.
    I'd be okay with that too.

    > Should quote raise a TypeError if given a bytes with encoding/errors
    > arguments? (Motivation: TypeError is what you usually raise if you
    > supply too many args to a function).
    Resolved. Raises TypeError.

    > Lib/urllib/parse.py:
    > (As discussed above) Should quote accept safe characters outside the
    > ASCII range (thereby potentially producing invalid URIs)?
    Resolved? Implemented, but too messy and not worth it just to produce
    invalid URIs, so NOT in patch.

    Agreed, safe should be ASCII chars only.

    That's only two very minor yes/no issues remaining. Please comment.

    I believe patch 9 still has errors defaulting to strict for quote().
    Weren't you going to change that?

    Regarding using UTF-8 as the default encoding, I still think this the
    right thing to do -- while the tables shown by Bill indicate that
    there's still a lot of Latin-1 out there, UTF-8 is definitely gaining
    on it, and I expect that Python apps, especially Py3k apps, are much
    more likely to follow (and hopefully reinforce! :-) this trend than to
    lag behind.

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Aug 13, 2008

    I have no strong opinion on the very remaining points you listed,
    except that IMHO encode_rfc2231 with charset=None should not try to
    use UTF8 by default. But someone with more mail protocol skills
    should comment :)

    OK I've come to the realization that DEMANDING ascii (and erroring on
    non-ASCII chars) is better for the short term anyway, because we can
    always decide later to relax the restrictions, but it's a lot worse to
    add restrictions later. So I agree now, should be ASCII. And no, I don't
    have mail protocol skills.

    The same goes for unquote accepting bytes. We can decide to make it
    accept bytes later, but can't remove that feature later, so it's best
    (IMHO) to let it NOT accept bytes (which is the current behaviour).

    The bytes > 127 would be translated as themselves; this follows
    logically from how stuff is parsed -- %% and %FF are translated,
    everything else is not. But I don't really care, I doubt there's a
    need.

    Ah but what about unquote (to string)? If it accepted bytes then it
    would be a bytes->str operation, and then you need a policy on DEcoding
    those bytes. It makes things too complex I think.

    I believe patch 9 still has errors defaulting to strict for quote().
    Weren't you going to change that?

    I raised it as a concern, but I thought you overruled on that, so I left
    it as errors='strict'. What do you want it to be? 'replace'? Now that
    this issue has been fully discussed, I'm happy with whatever you decide.

    From looking at it briefly I
    worry that the implementation is pretty slow -- a method call for each
    character and a map() call sounds pretty bad.

    Yes, it does sound pretty bad. However, that's the current way of doing
    things in both 2.x and 3.x; I didn't change it (though it looks like I
    changed a LOT, I really did try to change as little as possible!)
    Assuming it wasn't made _slower_ than before, can we ignore existing
    performance issues and treat them as a separate matter (and can be dealt
    with after 3.0)?

    I'm not putting up a new patch now. The only fix I'd make is to add
    Antoine's "or 'ascii'" to email/utils.py, as suggested on the review
    tracker. I'll make this change along with any other recommendations
    after your review.

    (That is Lib/email/utils.py line 222 becomes:
    s = urllib.parse.quote(s, safe='', encoding=charset or 'ascii')
    )

    btw this Rietveld is amazing. I'm assuming I don't have permission to
    upload patches there (can't find any button to do so) which is why I
    keep posting them here and letting you upload to Rietveld ...

    @gvanrossum
    Copy link
    Member

    On Wed, Aug 13, 2008 at 7:25 AM, Matt Giuca <report@bugs.python.org> wrote:

    > I have no strong opinion on the very remaining points you listed,
    > except that IMHO encode_rfc2231 with charset=None should not try to
    > use UTF8 by default. But someone with more mail protocol skills
    > should comment :)

    OK I've come to the realization that DEMANDING ascii (and erroring on
    non-ASCII chars) is better for the short term anyway, because we can
    always decide later to relax the restrictions, but it's a lot worse to
    add restrictions later. So I agree now, should be ASCII. And no, I don't
    have mail protocol skills.

    OK.

    The same goes for unquote accepting bytes. We can decide to make it
    accept bytes later, but can't remove that feature later, so it's best
    (IMHO) to let it NOT accept bytes (which is the current behaviour).

    OK.

    > The bytes > 127 would be translated as themselves; this follows
    > logically from how stuff is parsed -- %% and %FF are translated,
    > everything else is not. But I don't really care, I doubt there's a
    > need.

    Ah but what about unquote (to string)? If it accepted bytes then it
    would be a bytes->str operation, and then you need a policy on DEcoding
    those bytes. It makes things too complex I think.

    OK.

    > I believe patch 9 still has errors defaulting to strict for quote().
    > Weren't you going to change that?

    I raised it as a concern, but I thought you overruled on that, so I left
    it as errors='strict'. What do you want it to be? 'replace'? Now that
    this issue has been fully discussed, I'm happy with whatever you decide.

    I'm OK with replace for unquote(), your point that bogus data is
    better than an exception is well taken, especially since there are
    calls that the app can't control (like in cgi.py).

    For quote() I think strict is better -- it can't fail anyway with
    UTF8, and if an app passes an explicit conversion it'd be pretty
    stupid to pass a string that can't be converted with that encoding
    (since it's presumably the app that generates both the string and the
    encoding) so it's better to complain there, just like if they made the
    encode() call themselves with only an encoding specified. This means
    we have a useful analogy: quote(s, e) == quote(s.encode(e)).

    > From looking at it briefly I
    > worry that the implementation is pretty slow -- a method call for each
    > character and a map() call sounds pretty bad.

    Yes, it does sound pretty bad. However, that's the current way of doing
    things in both 2.x and 3.x; I didn't change it (though it looks like I
    changed a LOT, I really did try to change as little as possible!)

    Actually, while the Quoter class (the immediat subject of my scorn)
    was there before your patch in 3.0, it isn't there in 2.x; somebody
    must have added it in 3.0 as part of the conversion to Unicode or
    perhaps as part of the restructuring of urllib.py. The 2.x code maps
    the __getitem__ of a dict over the string, which is much faster. I
    think we can do much better than mapping a method call.

    Assuming it wasn't made _slower_ than before, can we ignore existing
    performance issues and treat them as a separate matter (and can be dealt
    with after 3.0)?

    Now that you've spent so much time with this patch, can't you think
    of a faster way of doing this? I wonder if mapping a defaultdict
    wouldn't work.

    I'm not putting up a new patch now. The only fix I'd make is to add
    Antoine's "or 'ascii'" to email/utils.py, as suggested on the review
    tracker. I'll make this change along with any other recommendations
    after your review.

    (That is Lib/email/utils.py line 222 becomes:
    s = urllib.parse.quote(s, safe='', encoding=charset or 'ascii')
    )

    btw this Rietveld is amazing. I'm assuming I don't have permission to
    upload patches there (can't find any button to do so) which is why I
    keep posting them here and letting you upload to Rietveld ...

    Thanks! You can't upload patches to the issue that *I* created, but a
    better way would be to create a new issue and assign it to me for
    review. That will work as long as you have a gmail account or a Google
    Account. I highly recommend using the upload.py script, which you can
    download from codereview.appspot.com/static/upload.py. (There's also a
    link to it on the Create Issue page, at the bottom.)

    I am hoping that in general we will be able to use Rietveld to review
    patches instead of the bug tracker.

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Aug 13, 2008

    I'm OK with replace for unquote() ...
    For quote() I think strict is better

    There's just an odd inconsistency there, but it's only a tiny "gotcha";
    and I agree with all your other arguments. I'll change unquote back to
    errors='replace'.

    This means we have a useful analogy:
    quote(s, e) == quote(s.encode(e)).

    That's exactly true, yes.

    Now that you've spent so much time with this patch, can't you think
    of a faster way of doing this?

    Well firstly, you could replace Quoter (the class) with a "quoter"
    function, which is nested inside quote. Would calling a nested function
    be faster than a method call?

    I wonder if mapping a defaultdict wouldn't work.

    That is a good idea. Then, the "function" (as I describe above) would be
    just the inside of what currently is the except block, and that would be
    the default_factory of the defaultdict. I think that should speed things up.

    I'm very hazy about what is faster in the bytecode world of Python, and
    wary of making a change and proclaiming "this is faster!" without doing
    proper speed tests (which is why I think this optimisation could be
    delayed until at least after the core interface changes are made). But
    I'll have a go at that change tomorrow.

    (I won't be able to work on this for up to 24 hours).

    @pitrou
    Copy link
    Member

    pitrou commented Aug 13, 2008

    Selon Matt Giuca <report@bugs.python.org>:

    > Now that you've spent so much time with this patch, can't you think
    > of a faster way of doing this?

    Well firstly, you could replace Quoter (the class) with a "quoter"
    function, which is nested inside quote. Would calling a nested function
    be faster than a method call?

    The obvious speedup is to remove the map() call and do the loop inside
    Quoter.__call__ instead. That way you don't have any function or method call in
    the critical path.

    (also, defining a class with a single __call__ method is not a common Python
    idiom; usually you'd just have a function returning another (nested) function)

    As for the defaultdict, here is how it can look like (this is on 2.5):

    ...  def __missing__(self, key):
    ...   print "__missing__", key
    ...   value = "%%%02X" % key
    ...   self[key] = value
    ...   return value
    ...
    >>> d = D()
    >>> d[66] = 'B'
    >>> d[66]
    'B'
    >>> d[67]
    __missing__ 67
    '%43'
    >>> d[67]
    '%43'

    @pitrou
    Copy link
    Member

    pitrou commented Aug 13, 2008

    Selon Antoine Pitrou <report@bugs.python.org>:

    As for the defaultdict, here is how it can look like (this is on 2.5):

    (there should be a line here saying "class D(defaultdict)" :-))

    ... def __missing__(self, key):
    ... print "__missing__", key
    ... value = "%%%02X" % key
    ... self[key] = value
    ... return value

    cheers

    Antoine.

    @gvanrossum
    Copy link
    Member

    > Now that you've spent so much time with this patch, can't you think
    > of a faster way of doing this?

    Well firstly, you could replace Quoter (the class) with a "quoter"
    function, which is nested inside quote. Would calling a nested function
    be faster than a method call?

    Yes, but barely.

    > I wonder if mapping a defaultdict wouldn't work.

    That is a good idea. Then, the "function" (as I describe above) would be
    just the inside of what currently is the except block, and that would be
    the default_factory of the defaultdict. I think that should speed things up.

    Yes, it would be tremendously faster, since the method would be called
    only once per byte value (for each value of 'safe'), and if that byte
    is repeated in the input, further occurrences will use the __getitem__
    function of the defaultdict, which is implemented in C.

    I'm very hazy about what is faster in the bytecode world of Python, and
    wary of making a change and proclaiming "this is faster!" without doing
    proper speed tests (which is why I think this optimisation could be
    delayed until at least after the core interface changes are made).

    That's very wise. But a first-order approximation of the speed of
    something is often "how many functions/methods implemented in Python
    (i.e. with def or lambda) does it call?"

    But I'll have a go at that change tomorrow.

    (I won't be able to work on this for up to 24 hours).

    That's fine, as long as we have closure before beta3, which is next Wednesday.

    @janssen
    Copy link
    Mannequin

    janssen mannequin commented Aug 13, 2008

    Feel free to take the function implementation from my patch, if it speeds
    things up (and it should).

    Bill

    On Wed, Aug 13, 2008 at 9:41 AM, Guido van Rossum <report@bugs.python.org>wrote:

    Guido van Rossum <guido@python.org> added the comment:

    >> Now that you've spent so much time with this patch, can't you think
    >> of a faster way of doing this?
    >
    > Well firstly, you could replace Quoter (the class) with a "quoter"
    > function, which is nested inside quote. Would calling a nested function
    > be faster than a method call?

    Yes, but barely.

    >> I wonder if mapping a defaultdict wouldn't work.
    >
    > That is a good idea. Then, the "function" (as I describe above) would be
    > just the inside of what currently is the except block, and that would be
    > the default_factory of the defaultdict. I think that should speed things
    up.

    Yes, it would be tremendously faster, since the method would be called
    only once per byte value (for each value of 'safe'), and if that byte
    is repeated in the input, further occurrences will use the __getitem__
    function of the defaultdict, which is implemented in C.

    > I'm very hazy about what is faster in the bytecode world of Python, and
    > wary of making a change and proclaiming "this is faster!" without doing
    > proper speed tests (which is why I think this optimisation could be
    > delayed until at least after the core interface changes are made).

    That's very wise. But a first-order approximation of the speed of
    something is often "how many functions/methods implemented in Python
    (i.e. with def or lambda) does it call?"

    > But I'll have a go at that change tomorrow.
    >
    > (I won't be able to work on this for up to 24 hours).

    That's fine, as long as we have closure before beta3, which is next
    Wednesday.


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue3300\>


    @janssen
    Copy link
    Mannequin

    janssen mannequin commented Aug 13, 2008

    Erik van der Poel at Google has now chimed in with stats on current URL
    usage:

    ``...the bottom line is that escaped non-utf-8 is still quite prevalent,
    enough (in my opinion) to require an implementation in Python, possibly
    even allowing for different encodings in the path and query parts (e.g.
    utf-8 path and gb2312 query).''

    http://lists.w3.org/Archives/Public/www-international/2008JulSep/0042.html

    I think it's worth remembering that a very large proportion of the use
    of Python's urllib.unquote() is in implementations of Web server
    frameworks of one sort or another. We can't control what the browsers
    that talk to such frameworks produce; the IETF doesn't control that,
    either. In this case, "practicality beats purity" is the clarion call
    of the browser designers, and we'd better be able to support them.

    @gvanrossum
    Copy link
    Member

    Bill Janssen bill.janssen@gmail.com added the comment:

    Erik van der Poel at Google has now chimed in with stats on current URL
    usage:

    ``...the bottom line is that escaped non-utf-8 is still quite prevalent,
    enough (in my opinion) to require an implementation in Python, possibly
    even allowing for different encodings in the path and query parts (e.g.
    utf-8 path and gb2312 query).''

    http://lists.w3.org/Archives/Public/www-international/2008JulSep/0042.html

    I think it's worth remembering that a very large proportion of the use
    of Python's urllib.unquote() is in implementations of Web server
    frameworks of one sort or another. We can't control what the browsers
    that talk to such frameworks produce; the IETF doesn't control that,
    either. In this case, "practicality beats purity" is the clarion call
    of the browser designers, and we'd better be able to support them.

    I think we're supporting these sufficiently by allowing developers to
    override the encoding and errors value. I see no argument here against
    having a default encoding of UTF-8.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 13, 2008

    Le mercredi 13 août 2008 à 17:05 +0000, Bill Janssen a écrit :

    I think it's worth remembering that a very large proportion of the use
    of Python's urllib.unquote() is in implementations of Web server
    frameworks of one sort or another. We can't control what the browsers
    that talk to such frameworks produce;

    Yes, we do. Browsers will use whatever charset is specified in the HTML
    for the query part; and, as for the path part, they should't produce it
    themselves, they just follow a link which should already be
    percent-quoted in the HTML.

    (URL rewriting at the HTTP server level can make this more complicated,
    since it can turn a query fragment into a path fragment or vice-versa;
    however, most modern frameworks alleviate the need for such rewriting,
    since they allow to specify flexible mapping rules at the framework
    level)

    The situation in which we can't control the encoding is when getting the
    URLs from third-part content (e.g. some Web page which we didn't produce
    ourselves, or some link in an e-email). But in those cases there's less
    use cases for unquoting the URL rather than use it as-is. The only time
    I've wanted to unquote such an URL was to do some processing of HTTP
    referrers in order to extract which search queries had led people to
    visit a Web site.

    @janssen
    Copy link
    Mannequin

    janssen mannequin commented Aug 13, 2008

    On Wed, Aug 13, 2008 at 10:51 AM, Antoine Pitrou <report@bugs.python.org>wrote:

    Antoine Pitrou <pitrou@free.fr> added the comment:

    Le mercredi 13 août 2008 à 17:05 +0000, Bill Janssen a écrit :
    > I think it's worth remembering that a very large proportion of the use
    > of Python's urllib.unquote() is in implementations of Web server
    > frameworks of one sort or another. We can't control what the browsers
    > that talk to such frameworks produce;

    Yes, we do. Browsers will use whatever charset is specified in the HTML
    for the query part; and, as for the path part, they should't produce it
    themselves, they just follow a link which should already be
    percent-quoted in the HTML.

    Sure. What I meant was that we don't control what the browsers do, we just
    go along with what they do, that is, we try to play with the default
    understanding that's developed between the "consenting pairs" of
    Apache/Firefox or ASP/IE.

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Aug 14, 2008

    Ah cheers Antoine, for the tip on using defaultdict (I was confused as
    to how I could access the key just by passing defaultfactory, as the
    manual suggests).

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Aug 14, 2008

    OK I implemented the defaultdict solution. I got curious so ran some
    rough speed tests, using the following code.

    import random, urllib.parse
    for i in range(0, 100000):
        str = ''.join(chr(random.randint(0, 0x10ffff)) for _ in range(50))
        quoted = urllib.parse.quote(str)

    Time to quote 100,000 random strings of 50 characters.
    (Ran each test twice, worst case printed)

    HEAD, chars in range(0,0x110000): 1m44.80
    HEAD, chars in range(0,256): 25.0s
    patch9, chars in range(0,0x110000): 35.3s
    patch9, chars in range(0,256): 27.4s
    New, chars in range(0,0x110000): 31.4s
    New, chars in range(0,256): 25.3s

    Head is the current Py3k head. Patch 9 is my previous patch (before
    implementing defaultdict), and New is after implementing defaultdict.

    Interesting. Defaultdict didn't really make much of an improvement. You
    can see the big help the cache itself makes, though (my code caches all
    chars, whereas the HEAD just caches ASCII chars, which is why HEAD is so
    slow on the full repertoire test). Other than that, differences are
    fairly negligible.

    However, I'll keep the defaultdict code, I quite like it, speedy or not
    (it is slightly faster).

    @pitrou
    Copy link
    Member

    pitrou commented Aug 14, 2008

    Hello Matt,

    OK I implemented the defaultdict solution. I got curious so ran some
    rough speed tests, using the following code.

    import random, urllib.parse
    for i in range(0, 100000):
    str = ''.join(chr(random.randint(0, 0x10ffff)) for _ in range(50))
    quoted = urllib.parse.quote(str)

    I think if you move the line defining "str" out of the loop, relative timings
    should change quite a bit. Chances are that the random functions are not very
    fast, since they are written in pure Python.
    Or you can create an inner loop around the call to quote(), for example to
    repeat it 100 times.

    cheers

    Antoine.

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Aug 14, 2008

    New patch (patch10). Details on Rietveld review tracker
    (http://codereview.appspot.com/2827).

    Another update on the remaining "outstanding issues":

    Resolved issues since last time:

    Should unquote accept a bytes/bytearray as well as a str?
    No. But see below.

    Lib/email/utils.py:
    Should encode_rfc2231 with charset=None accept strings with non-ASCII
    characters, and just encode them to UTF-8?
    Implemented Antoine's fix ("or 'ascii'").

    Should quote accept safe characters outside the
    ASCII range (thereby potentially producing invalid URIs)?
    No.

    New issues:

    unquote_to_bytes doesn't cope well with non-ASCII characters (currently
    encodes as UTF-8 - not a lot we can do since this is a str->bytes
    operation). However, we can allow it to accept a bytes as input (while
    unquote does not), and it preserves the bytes precisely.
    Discussion at http://codereview.appspot.com/2827/diff/82/84, line 265.

    I have *implemented* that suggestion - so unquote_to_bytes now accepts
    either a bytes or str, while unquote accepts only a str. No changes need
    to be made unless there is disagreement on that decision.

    I also emailed Barry Warsaw about the email/utils.py patch (because we
    weren't sure exactly what that code was doing). However, I'm sure that
    this patch isn't breaking anything there, because I call unquote with
    encoding="latin-1", which is the same behaviour as the current head.

    That's all the issues I have left over in this patch.

    Attaching patch 10 (for revision 65675).

    Commit log for patch 10:

    Fix for bpo-3300.

    urllib.parse.unquote:
    Added "encoding" and "errors" optional arguments, allowing the caller
    to determine the decoding of percent-encoded octets.
    As per RFC 3986, default is "utf-8" (previously implicitly decoded
    as ISO-8859-1).
    Fixed a bug in which mixed-case hex digits (such as "%aF") weren't
    being decoded at all.

    urllib.parse.quote:
    Added "encoding" and "errors" optional arguments, allowing the
    caller to determine the encoding of non-ASCII characters
    before being percent-encoded.
    Default is "utf-8" (previously characters in range(128, 256)
    were encoded as ISO-8859-1, and characters above that as UTF-8).
    Characters/bytes above 128 are no longer allowed to be "safe".
    Now allows either bytes or strings.
    Optimised "Quoter"; now inherits defaultdict.

    Added functions urllib.parse.quote_from_bytes,
    urllib.parse.unquote_to_bytes.
    All quote/unquote functions now exported from the module.

    Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
    reflect new interface, added quote_from_bytes and unquote_to_bytes.

    Lib/test/test_urllib.py: Added many new test cases testing encoding
    and decoding Unicode strings with various encodings, as well as testing
    the new functions.

    Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
    Lib/test/test_wsgiref.py: Updated and added test cases to deal with
    UTF-8-encoded URIs.

    Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
    with encoding="latin-1", to preserve existing behaviour (which the email
    module is dependent upon).

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Aug 14, 2008

    Antoine:

    I think if you move the line defining "str" out of the loop, relative
    timings should change quite a bit. Chances are that the random
    functions are not very fast, since they are written in pure Python.

    Well I wanted to test throwing lots of different URIs to test the
    caching behaviour. You're right though, probably a small % of the time
    is spent on calling quote.

    Oh well, the defaultdict implementation is in patch10 anyway :) It
    cleans Quoter up somewhat, so it's a good thing anyway. Thanks for your
    help.

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Aug 18, 2008

    Hi,

    Sorry to bump this, but you (Guido) said you wanted this closed by
    Wednesday. Is this patch committable yet? (There are no more unresolved
    issues that I am aware of).

    @gvanrossum
    Copy link
    Member

    Looking into this now. Will make sure it's included in beta3.

    @gvanrossum
    Copy link
    Member

    Checked in patch 10 with minor style changes as r65838.

    Thanks Matt for persevering! Thanks everyone else for contributing;
    this has been quite educational.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 20, 2008

    There's an unquote()-related failure in bpo-3613.

    @mgiuca
    Copy link
    Mannequin Author

    mgiuca mannequin commented Aug 20, 2008

    Thanks for pointing that out, Antoine. I just commented on that bug.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants