Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urlencode does not handle "bytes" and could easily handle alternate encodings #49718

Closed
dmahn mannequin opened this issue Mar 10, 2009 · 10 comments
Closed

urlencode does not handle "bytes" and could easily handle alternate encodings #49718

dmahn mannequin opened this issue Mar 10, 2009 · 10 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@dmahn
Copy link
Mannequin

dmahn mannequin commented Mar 10, 2009

BPO 5468
Nosy @terryjreedy, @orsenthil, @ezio-melotti
Files
  • new_urlencode.py: Updated urlencode() function with doctest
  • new_urlencode_tests.py: Additional urlencode() tests for use in test_urllib.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/orsenthil'
    closed_at = <Date 2010-07-03.17:59:30.311>
    created_at = <Date 2009-03-10.14:45:14.864>
    labels = ['type-bug', 'library']
    title = 'urlencode does not handle "bytes" and could easily handle alternate encodings'
    updated_at = <Date 2010-07-03.17:59:30.310>
    user = 'https://bugs.python.org/dmahn'

    bugs.python.org fields:

    activity = <Date 2010-07-03.17:59:30.310>
    actor = 'orsenthil'
    assignee = 'orsenthil'
    closed = True
    closed_date = <Date 2010-07-03.17:59:30.311>
    closer = 'orsenthil'
    components = ['Library (Lib)']
    creation = <Date 2009-03-10.14:45:14.864>
    creator = 'dmahn'
    dependencies = []
    files = ['13294', '13298']
    hgrepos = []
    issue_num = 5468
    keywords = ['patch']
    message_count = 10.0
    messages = ['83434', '83448', '84216', '84228', '84260', '89416', '92029', '108290', '109101', '109187']
    nosy_count = 6.0
    nosy_names = ['jhylton', 'terry.reedy', 'orsenthil', 'ezio.melotti', 'dmahn', 'milesck']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue5468'
    versions = ['Python 3.2']

    @dmahn
    Copy link
    Mannequin Author

    dmahn mannequin commented Mar 10, 2009

    urllib.parse.urlencode() uses quote_plus() extensively to create a
    complete query string, but doesn't effectively/properly take advantage
    of the flexibility built into quote_plus(). Namely:

    1. Instances of type "bytes" are not properly encoded, as str() is used
      prior to passing to quote_plus(). This creates a nonsensical string
      such as b'1234', while quote_plus() can handle these types properly if
      passed intact. The ability to encode this type is particularly useful
      for putting binary data into the query string, or for pre-encoded text
      which you may want to encode in a non-standard character encoding.

    2. Sometimes it would be desirable to encode query strings entirely in
      "latin-1" or possibly "ascii" instead of "utf-8". Adding the extra
      parameters now present on quote_plus() can easily give that extra
      functionality.

    I have attached a new version of urlencode() that provides both of the
    above fixes/enhancements. Additionally, an unused codepath in the
    existing function has been eliminated/cleaned up. Some doctests are
    included as well.

    @dmahn dmahn mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Mar 10, 2009
    @dmahn
    Copy link
    Mannequin Author

    dmahn mannequin commented Mar 10, 2009

    I also made some tests for the new code that could be added to the unit
    tests in test_urllib.py

    @jhylton
    Copy link
    Mannequin

    jhylton mannequin commented Mar 26, 2009

    I'm not sure I understand the part of the code that deals with binary
    strings. I agree the current behavior is odd. RFC 2396 says that
    non-ascii characters must be encoded as utf-8 and then percent escaped.
    In the test case you started with, you encoded b'\xa0\x24'. It doesn't
    seem like this should be allowed, since it is not valid utf-8.

    @dmahn
    Copy link
    Mannequin Author

    dmahn mannequin commented Mar 26, 2009

    Hello. Thanks for the feedback.

    With regards to RFC 2396, I see this:

    http://www.ietf.org/rfc/rfc2396.txt

    ====
    There is a second translation for some resources: the sequence of
    octets defined by a component of the URI is subsequently used to
    represent a sequence of characters. A 'charset' defines this mapping.
    There are many charsets in use in Internet protocols. For example,
    UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences
    of characters in the repertoire of ISO 10646.
    ====

    To me, that text does not indicate that URLs are always encoded in
    UTF-8. It indicates that URL information may be encoded in character
    sets ('charset') other than ASCII, and when it is, the values must be
    sent as escaped values. Here, I note the specific words "many charsets
    in use" and "For example", before the reference to UTF-8.

    I have also done a few tests, and have found that in practice, browsers
    do not always encode URLs as UTF-8. This actually seems to differ as to
    what part of the URL is being encoded. For instance, my Firefox will
    encode the path portion of a URL as UTF-8, but encode the query string
    as Latin-1.

    I think that the general idea is ... URL data must be encoded into
    ASCII, but as to what the data is that is being encoded ... That may be
    of some "charset" which may be application-defined. And in the most
    general sense, I would argue that the data could simply be binary data.
    (Actually, Latin-1 pretty much uses all the codes from 0 to 255, so
    it's very much like plain binary data anyway.)

    I hope that clarifies what I am reading in RFC 2396.

    In addition, quote_plus() already handles all the cases I placed into
    urlencode(). I suppose the actual test cases may be debatable, but I
    did specifically choose tests with data which would be recognized as
    something other then UTF-8.

    Jeremy Hylton wrote:

    Jeremy Hylton <jeremy@alum.mit.edu> added the comment:

    I'm not sure I understand the part of the code that deals with binary
    strings. I agree the current behavior is odd. RFC 2396 says that
    non-ascii characters must be encoded as utf-8 and then percent escaped.
    In the test case you started with, you encoded b'\xa0\x24'. It doesn't
    seem like this should be allowed, since it is not valid utf-8.

    ----------
    nosy: +jhylton


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue5468\>


    @dmahn dmahn mannequin changed the title urlencode does not handle "bytes", and could easily handle alternate encodings urlencode does not handle "bytes", and could easily handle alternate encodings Mar 26, 2009
    @jhylton
    Copy link
    Mannequin

    jhylton mannequin commented Mar 27, 2009

    Indeed, I think I confused some other character encoding issues related
    to HTTP with the URI issue. The discussion in RFC 3986 is length and
    only occasionally clarifying for this issue. That is, it doesn't say
    anything definitive like applications are free to use any character
    encoding when decoding a URI. But I think it agrees with your
    assessment that an application is free to interpret the binary data
    however it wants, e.g. http://tools.ietf.org/html/rfc3986#section-2.1

    @jhylton jhylton mannequin self-assigned this Mar 27, 2009
    @milesck
    Copy link
    Mannequin

    milesck mannequin commented Jun 15, 2009

    parse_qs and parse_qsl should also grow encoding and errors parameters to
    pass to the underlying unquote().

    @milesck
    Copy link
    Mannequin

    milesck mannequin commented Aug 28, 2009

    I've attached a patch that provides similar functionality to Dan Mahn's
    urlencode(), as well as providing encoding and errors parameters to
    parse_qs and parse_qsl, updating the documentation to reflect the added
    parameters, and adding test cases. The implementation of urlencode() is
    not the same as dmahn's, and has a more straightforward control flow and
    less code duplication than the current implementation.

    (For the tests, I tried to match the style of the file I was adding to
    with regard to (expect, result) order, which is why it's inconsistent.)

    @terryjreedy
    Copy link
    Member

    The question of whether % escape should be limited to utf-8 or not was discussed and decided in favor of 'not' in bpo-3300, quote and unquote.

    Last December, a websig post (referenced yesterday on pydev) reported a 'problem' that would be solved by Miles' suggestion to include parse_qs and parse_qsl.

    @orsenthil
    Copy link
    Member

    I see no problem in going ahead with the suggestion proposed and the patch.

    Relevant line:
    When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded.

    • This is done already in quote and quote_plus.
    • It just boils down to urlencode also providing the same facility for query strings and that was the point of this bug report.

    Jeremy, I shall go ahead with this and do the modifications, if required.

    @orsenthil orsenthil assigned orsenthil and unassigned jhylton Jul 2, 2010
    @orsenthil
    Copy link
    Member

    Fixed and Committed revision 82510 (py3k) and revision 82511 (release31-maint).

    This fixes urlencode issue. parse_qs and parse_qsl can have the same capabilities. It will be done subsequently (in another commit or issue)

    Thanks Dan for the bug report and patch.

    @orsenthil orsenthil changed the title urlencode does not handle "bytes", and could easily handle alternate encodings urlencode does not handle "bytes" and could easily handle alternate encodings Jul 3, 2010
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants