Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http.client.HTTPConnection.putrequest encode error #61416

Closed
MiZou mannequin opened this issue Feb 16, 2013 · 17 comments
Closed

http.client.HTTPConnection.putrequest encode error #61416

MiZou mannequin opened this issue Feb 16, 2013 · 17 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@MiZou
Copy link
Mannequin

MiZou mannequin commented Feb 16, 2013

BPO 17214
Nosy @orsenthil, @tiran, @ezio-melotti, @berkerpeksag, @vadmium, @vajrasky
Files
  • patch_to_urllib_handle_non_ascii_char_in_url.txt
  • issue17214.patch
  • issue17214.redirect.v2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-05-16.09:44:51.646>
    created_at = <Date 2013-02-16.10:52:39.782>
    labels = ['type-bug', 'library']
    title = 'http.client.HTTPConnection.putrequest encode  error'
    updated_at = <Date 2016-05-16.09:44:51.644>
    user = 'https://bugs.python.org/MiZou'

    bugs.python.org fields:

    activity = <Date 2016-05-16.09:44:51.644>
    actor = 'martin.panter'
    assignee = 'none'
    closed = True
    closed_date = <Date 2016-05-16.09:44:51.646>
    closer = 'martin.panter'
    components = ['Library (Lib)']
    creation = <Date 2013-02-16.10:52:39.782>
    creator = 'Mi.Zou'
    dependencies = []
    files = ['30964', '30976', '40907']
    hgrepos = []
    issue_num = 17214
    keywords = ['patch']
    message_count = 17.0
    messages = ['182216', '182218', '182679', '193183', '193279', '193286', '193306', '193312', '193346', '193352', '239648', '253409', '253410', '253771', '265521', '265682', '265690']
    nosy_count = 11.0
    nosy_names = ['orsenthil', 'christian.heimes', 'ezio.melotti', 'python-dev', 'berker.peksag', 'martin.panter', 'Mi.Zou', 'vajrasky', 'LDTech', 'Uche Ogbuji', 'Strecke']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue17214'
    versions = ['Python 3.5', 'Python 3.6']

    @MiZou
    Copy link
    Mannequin Author

    MiZou mannequin commented Feb 16, 2013

    while urllib following the redirection(302):
    urllib.client.HTTPConnection.putrequest raise an error:
    #

    File "D:\Program Files\Python32\lib\http\client.py", line 1004, in _send_request
    self.putrequest(method, url, **skips)
    File "D:\Program Files\Python32\lib\http\client.py", line 868, in putrequest
    self._output(request.encode('ascii'))
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 108-111: ordinal not in range(128)
    #----------------------------------------------------------
    in the sourcode i found that:
    at line 811
    def putrequest(self, method, url, skip_host=0,skip_accept_encoding=0)...
    the argument url may be a unicode,and it was unquoted..

    i think we should replace:
    request = '%s %s %s' (method,url,self._http_vsn_str)
    with:
    import urllib.parse
    request = '%s %s %s' (method,urllib.parse.quote(url),self._http_vsn_str)

    @MiZou MiZou mannequin added the topic-unicode label Feb 16, 2013
    @MiZou MiZou mannequin closed this as completed Feb 16, 2013
    @MiZou MiZou mannequin added the invalid label Feb 16, 2013
    @MiZou MiZou mannequin changed the title urllib.client.HTTPConnection.putrequest encode error http.client.HTTPConnection.putrequest encode error Feb 16, 2013
    @MiZou
    Copy link
    Mannequin Author

    MiZou mannequin commented Feb 16, 2013

    while urllib following the redirection(302):
    http.client.HTTPConnection.putrequest raise an error:
    #

    ...
    File "D:\Program Files\Python32\lib\http\client.py", line 1004, in _send_request
    self.putrequest(method, url, **skips)
    File "D:\Program Files\Python32\lib\http\client.py", line 868, in putrequest
    self._output(request.encode('ascii'))
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 108-111: ordinal not in range(128)
    #----------------------------------------------------------
    in the sourcode i found that:
    at line 811

    def putrequest(self, method, url, skip_host=0,skip_accept_en...)
    ...

    the argument url may be a unicode,and it was unquoted..
    ----------------------------note----------------------------------------
    in my case:
    ...
    purl="http://bbs.dospy.com/1111258attachdown.php?aid=14361277&bbsid=349"
    req=urllib.request.Request(purl,headers=headers)
    response=urllib.request.urlopen(req)
    ...

    then,the http serve redirect me to a file download url...
    and the url contains some Chinese word....
    i have print out the argument url:

    /f/1ba1f70606223af2aa5c3aeff6c6a46a/511f7b4c/day_111015/20111015_5949e996881b2e28403d26Ch6dOfj6LZ.rar/p/ÒâÁÖ03-08.part1.rar

    @MiZou MiZou mannequin reopened this Feb 16, 2013
    @MiZou MiZou mannequin removed the invalid label Feb 16, 2013
    @terryjreedy
    Copy link
    Member

    Please give us

    1. the exact Python version used. 3.2.3? or something earlier?
    2. A minimal but complete example that we can run. What is 'headers'?
    3. The complete traceback, not just the last two entries.
    4. The result of running with the newer 3.3.0, if you possibly can. Perhaps the problem has already been fixed.

    While line numbers have changed, even in 3.2.4 in repository, 3.2-3.4 all have

        request = '%s %s %s' % (method, url, self._http_vsn_str)
        # Non-ASCII characters should have been eliminated earlier
        self._output(request.encode('ascii'))
    

    Since there is nothing earlier in the function that would eliminate non-ascii, there must be an assumption about what happens earlier in the call chain. That might have already been fixed, which is why we need an example to test.

    @terryjreedy terryjreedy added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error and removed topic-unicode labels Feb 22, 2013
    @LDTech
    Copy link
    Mannequin

    LDTech mannequin commented Jul 16, 2013

    This problem still exist in Python 3.3.2. The following code gives you an example:

    import urllib.request
    url = "http://www.libon.it/libon/search/isbn/3499155443"
    req = urllib.request.Request(url)
    response = urllib.request.urlopen(req, timeout=30)
    the_page = response.read().decode('utf-8')
    print(the_page)
    Traceback (most recent call last):
      File "C:\X\webpy.py", line 4, in <module>
        response = urllib.request.urlopen(req, timeout=30)
      File "C:\Python33\lib\urllib\request.py", line 156, in urlopen
        return opener.open(url, data, timeout)
      File "C:\Python33\lib\urllib\request.py", line 475, in open
        response = meth(req, response)
      File "C:\Python33\lib\urllib\request.py", line 587, in http_response
        'http', request, response, code, msg, hdrs)
      File "C:\Python33\lib\urllib\request.py", line 507, in error
        result = self._call_chain(*args)
      File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain
        result = func(*args)
      File "C:\Python33\lib\urllib\request.py", line 692, in http_error_302
        return self.parent.open(new, timeout=req.timeout)
      File "C:\Python33\lib\urllib\request.py", line 469, in open
        response = self._open(req, data)
      File "C:\Python33\lib\urllib\request.py", line 487, in _open
        '_open', req)
      File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain
        result = func(*args)
      File "C:\Python33\lib\urllib\request.py", line 1268, in http_open
        return self.do_open(http.client.HTTPConnection, req)
      File "C:\Python33\lib\urllib\request.py", line 1248, in do_open
        h.request(req.get_method(), req.selector, req.data, headers)
      File "C:\Python33\lib\http\client.py", line 1061, in request
        self._send_request(method, url, body, headers)
      File "C:\Python33\lib\http\client.py", line 1089, in _send_request
        self.putrequest(method, url, **skips)
      File "C:\Python33\lib\http\client.py", line 953, in putrequest
        self._output(request.encode('ascii'))
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 78-79: ordinal not in range(128)

    @vajrasky
    Copy link
    Mannequin

    vajrasky mannequin commented Jul 18, 2013

    The script for demonstrating bug can be simplified to:

    -----------------------------------------------------------------------

    import urllib.request
    url = "http://www.libon.it/ricerca/7817940/3499155443/dettaglio/3102314/Onkel-Oswald-und-der-Sudan-Käfer/order/date_desc"
    
    req = urllib.request.Request(url)
    response = urllib.request.urlopen(req, timeout=30)
    the_page = response.read().decode('utf-8')
    print(the_page)

    Attached the simple patch to solve this problem.

    The question is whether we should fix this problem in urllib or not because strictly speaking the url should be ascii characters only. But if the Firefox can open this url, why not urllib?

    I will contemplate about this problem and if I (or other people) think that urllib should handle url containing non-ascii characters, then I will add additional unit test.

    Until then, people can use third party package, which is
    request package from http://docs.python-requests.org/en/latest/

    ----------------------------------------------------------------

    r = requests.get("http://www.libon.it/ricerca/7817940/3499155443/dettaglio/3102314/Onkel-Oswald-und-der-Sudan-Käfer/order/date_desc")
    print(r.text)

    @tiran
    Copy link
    Member

    tiran commented Jul 18, 2013

    The problem may not be a bug but a deliberate design choice. urllib is rather low level and doesn't implement some browser magic. Browsers handle stuff like 'ä' -> '%C3%A4', ' ' -> '%20' or IDNA but urllib doesn't. I always saw it as may responsibility to quote and encode everything myself. Higher level APIs such as requests are free to implement browser magic.

    Contrary to common believes an URL with an umlaut or space is *not* a valid URI. From http://docs.python.org/3/library/urllib.request.html#urllib.request.Request

    url should be a string containing a valid URL.

    I suggest that this ticket shall be closed as "won't fix".

    @vajrasky
    Copy link
    Mannequin

    vajrasky mannequin commented Jul 18, 2013

    I have no problem if this ticket is classified as "won't fix".

    I am writing this for the confused souls who want to use urllib to access url containing non-ascii characters:

    import urllib.request
    from urllib.parse import quote
    url = "http://www.libon.it/ricerca/7817940/3499155443/dettaglio/3102314/Onkel-Oswald-und-der-Sudan-Käfer/order/date_desc"
    
    req = urllib.request.Request(url)
    try:
        req.selector.encode('ascii')
    except UnicodeEncodeError:
        req.selector = quote(req.selector)
    response = urllib.request.urlopen(req, timeout=30)
    the_page = response.read().decode('utf-8')
    print(the_page)

    @LDTech
    Copy link
    Mannequin

    LDTech mannequin commented Jul 18, 2013

    The problem isn't the original requested url, as it is legit. The problem appears after the 302 redirect when a new (malformed) url is received from the server. There need to be some kind of check of the validity of that second url. And, preferably, an URLError returned if something is wrong.

    @vajrasky
    Copy link
    Mannequin

    vajrasky mannequin commented Jul 19, 2013

    Lars, I see.

    For the uninitiated, the issue is the original url (containing only ascii character) redirects to the url containing non-ascii characters which upsets urllib.

    To handle that situation, you can do something like this:
    ---------------------

    import urllib.request
    from urllib.parse import quote
    url = "http://www.libon.it/libon/search/isbn/3499155443"
    req = urllib.request.Request(url)
    req.selector = urllib.parse.quote(req.selector)
    response = urllib.request.urlopen(req, timeout=30)
    the_page = response.read().decode('utf-8')
    print(the_page)

    I admit it that this code is clunky and not pythonic.

    I also believe in python standard library, we should have a module to access url containing non-ascii character in an easy manner.

    At the very least, maybe we can give proper error message. Something like this would be nice:

    "The url is not valid and contains non-ascii character: http://www.libon.it/ricerca/7817940/3499155443/dettaglio/3102314/Onkel-Oswald-und-der-Sudan-Käfer/order/date_desc. This url is redirected from this url: http://www.libon.it/libon/search/isbn/3499155443"

    Because users can be confused. They thought they already gave only-ascii-characters url (http://www.libon.it/libon/search/isbn/3499155443) to urllib, but why did they get encoding error?

    What do you say, Christian?

    @tiran
    Copy link
    Member

    tiran commented Jul 19, 2013

    Something else is going on here. A valid server never returns an URL with non-ASCII chars. Your test server does the right thing, too:

    $ LC_ALL=C wget http://www.libon.it/libon/search/isbn/3499155443
    --2013-07-19 11:01:54--  http://www.libon.it/libon/search/isbn/3499155443
    Resolving www.libon.it (www.libon.it)... 83.103.59.131
    Connecting to www.libon.it (www.libon.it)|83.103.59.131|:80... connected.
    HTTP request sent, awaiting response... 302 Moved Temporarily
    Location: http://www.libon.it/ricerca/7818684/3499155443/dettaglio/3102314/Onkel-Oswald-und-der-Sudan-K%C3%A4fer/order/date_desc [following]
    Incomplete or invalid multibyte sequence encountered
    --2013-07-19 11:01:54--  http://www.libon.it/ricerca/7818684/3499155443/dettaglio/3102314/Onkel-Oswald-und-der-Sudan-K%C3%A4fer/order/date_desc
    Reusing existing connection to www.libon.it:80.
    HTTP request sent, awaiting response... 200 OK
    Length: unspecified [text/html]

    I have digged through the code. Now I think that I know what's going on here. The header parsing code unquotes and converts the Location header. The code in the 302 handler doesn't compensate and therefore fails.

    Here is a patch that corrects the code in the 302 function.

    @vadmium
    Copy link
    Member

    vadmium commented Mar 31, 2015

    I think this patch needs a test. I left some comments on Reitveld as well. Perhaps there should also be a test to prove that redirects to URLs like /spaced%20path/ do not get mangled.

    Have a look at the HTTPRedirectHandler.redirect_request() method. Perhaps the code translating spaces to %20 could be merged with the fix for this issue.

    @Strecke
    Copy link
    Mannequin

    Strecke mannequin commented Oct 24, 2015

    The patch bpo-17214 did fix this issue in my 3.4.2 install on Ubuntu LTS.

    It triggered however another bug:

    File "/usr/local/lib/python3.4/urllib/request.py", line 646, in http_error_302
    path = urlparts.path if urlpaths.path else "/"
    NameError: name 'urlpaths' is not defined

    This is obviously a typo.

    I'm not sure if that one has been reported yet (a short google search didn't find anything) and I don't know how to provoke it independently.

    @Strecke
    Copy link
    Mannequin

    Strecke mannequin commented Oct 24, 2015

    I should have looked more closely.

    The typo is part of the patch. It should be corrected there.

    @vadmium
    Copy link
    Member

    vadmium commented Oct 30, 2015

    This bug only applies to Python 3. In Python 2, the non-ASCII bytes are sent through to the redirect target verbatim. I think this would also be the ideal way to handle the problem in 3, but percent-encoding them as proposed also seems good enough, and does not require hacking the HTTPConnection.putrequest() internals.

    My patch updates Christian’s patch:

    • Tested, so hopefully no typos :)
    • Add test cases based on bpo-22248, as well as a URL already including a percent sign
    • Process entire URL, not just the path component. A non-ASCII byte could just as easily be in the query component, for example.
    • Remove redundant encoding of space character from redirect_request() method.

    @vadmium
    Copy link
    Member

    vadmium commented May 14, 2016

    I will look at committing this soon

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 16, 2016

    New changeset cb09fdef19f5 by Martin Panter in branch '3.5':
    Issue bpo-17214: Percent-encode non-ASCII bytes in redirect targets
    https://hg.python.org/cpython/rev/cb09fdef19f5

    New changeset 841a9a3f3cf6 by Martin Panter in branch 'default':
    Issue bpo-14132, Issue bpo-17214: Merge two redirect handling fixes from 3.5
    https://hg.python.org/cpython/rev/841a9a3f3cf6

    @vadmium
    Copy link
    Member

    vadmium commented May 16, 2016

    I restored the “redundant” encoding of space, in case someone’s code was relying on this behaviour, and because redirect_request() is a publicly documented method.

    @vadmium vadmium closed this as completed May 16, 2016
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants