Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urlopen URL with unescaped space #59031

Closed
wichert mannequin opened this issue May 16, 2012 · 25 comments
Closed

urlopen URL with unescaped space #59031

wichert mannequin opened this issue May 16, 2012 · 25 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@wichert
Copy link
Mannequin

wichert mannequin commented May 16, 2012

BPO 14826
Nosy @gpshead, @orsenthil, @pitrou, @tiran, @merwok, @vadmium, @miss-islington
PRs
  • bpo-30458: Disallow control chars in http URLs. #12755
  • bpo-14826: document that URLopener quotes fullurl. #12758
  • [3.7] bpo-14826: document that URLopener quotes fullurl. (GH-12758) #12759
  • [3.6] bpo-14826: document that URLopener quotes fullurl. (GH-12758) #12760
  • Files
  • urllib-quote-14826.patch
  • urllib-request.patch: Followup patch for urllib-quote-14826.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/orsenthil'
    closed_at = None
    created_at = <Date 2012-05-16.08:54:33.063>
    labels = ['type-feature', 'library']
    title = 'urlopen URL with unescaped space'
    updated_at = <Date 2019-04-10.09:30:36.137>
    user = 'https://bugs.python.org/wichert'

    bugs.python.org fields:

    activity = <Date 2019-04-10.09:30:36.137>
    actor = 'miss-islington'
    assignee = 'orsenthil'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2012-05-16.08:54:33.063>
    creator = 'wichert'
    dependencies = []
    files = ['26311', '26314']
    hgrepos = []
    issue_num = 14826
    keywords = ['patch']
    message_count = 24.0
    messages = ['160811', '160930', '164945', '164955', '164957', '164958', '164966', '164969', '164973', '164975', '164976', '164980', '164981', '164982', '165023', '165046', '165047', '165049', '165050', '255525', '268985', '339836', '339839', '339841']
    nosy_count = 12.0
    nosy_names = ['gregory.p.smith', 'orsenthil', 'pitrou', 'jerub', 'christian.heimes', 'wichert', 'eric.araujo', 'antlong', 'rosslagerwall', 'python-dev', 'martin.panter', 'miss-islington']
    pr_nums = ['12755', '12758', '12759', '12760']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue14826'
    versions = ['Python 3.6']

    @wichert
    Copy link
    Mannequin Author

    wichert mannequin commented May 16, 2012

    There appears to be an odd networking issue with how urllib2 sends HTTP requests. Downloading an image from maw.liquifire.com gives an error:

    $ python -c 'import urllib2 ; urllib2.urlopen("http://maw.liquifire.com/maw?set=image[2302.000.13314 a]&call=url[file:325x445]")'
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
        return _opener.open(url, data, timeout)
      File "/usr/lib/python2.7/urllib2.py", line 400, in open
        response = self._open(req, data)
      File "/usr/lib/python2.7/urllib2.py", line 418, in _open
        '_open', req)
      File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
        result = func(*args)
      File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
        return self.do_open(httplib.HTTPConnection, req)
      File "/usr/lib/python2.7/urllib2.py", line 1180, in do_open
        r = h.getresponse(buffering=True)
      File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse
        response.begin()
      File "/usr/lib/python2.7/httplib.py", line 407, in begin
        version, status, reason = self._read_status()
      File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
        line = self.fp.readline()
      File "/usr/lib/python2.7/socket.py", line 447, in readline
        data = self._sock.recv(self._rbufsize)
    socket.error: [Errno 104] Connection reset by peer

    Downloading the same image using wget works fine:

    $ wget 'http://maw.liquifire.com/maw?set=image[2302.000.13314 a]&call=url[file:325x445]'             
    --2012-05-16 10:53:27--  http://maw.liquifire.com/maw?set=image[2302.000.13314%20a]&call=url[file:325x445]
    Resolving maw.liquifire.com (maw.liquifire.com)... 184.169.78.6
    Connecting to maw.liquifire.com (maw.liquifire.com)|184.169.78.6|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 11393 (11K) [image/jpeg]
    Saving to: `maw?set=image[2302.000.13314 a]&call=url[file:325x445]'

    100%[======================================>] 11,393 --.-K/s in 0.003s

    2012-05-16 10:53:27 (3.49 MB/s) - `maw?set=image[2302.000.13314 a]&call=url[file:325x445]' saved [11393/11393]

    @wichert wichert mannequin added the stdlib Python modules in the Lib dir label May 16, 2012
    @antlong
    Copy link
    Mannequin

    antlong mannequin commented May 16, 2012

    http://maw.liquifire.com/maw?set=image[2302.000.13314%20a]&call=url[file:325x445] works properly. Notice the %20 instead of ' '

    @jerub
    Copy link
    Mannequin

    jerub mannequin commented Jul 7, 2012

    Here is a patch that uses the same quoting logic in urllib.request.Request.__init__ as is used by urllib.request.URLopener.open()

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jul 8, 2012

    New changeset 01c8d800efd2 by Senthil Kumaran in branch '3.2':
    Fix bpo-14826 - make urllib.request.Request quoted url consistent with URLOpener open method.
    http://hg.python.org/cpython/rev/01c8d800efd2

    New changeset e6bb919b2623 by Senthil Kumaran in branch 'default':
    Fix bpo-14826 - make urllib.request.Request quoted url consistent with URLOpener open method.
    http://hg.python.org/cpython/rev/e6bb919b2623

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jul 8, 2012

    New changeset d931a3b64fd6 by Senthil Kumaran in branch '2.7':
    Fix bpo-14826 - make urllib.request.Request quoted url consistent with URLOpener open method.
    http://hg.python.org/cpython/rev/d931a3b64fd6

    @orsenthil
    Copy link
    Member

    Thanks for the patch, Stephen.

    @rosslagerwall
    Copy link
    Mannequin

    rosslagerwall mannequin commented Jul 8, 2012

    @rosslagerwall rosslagerwall mannequin reopened this Jul 8, 2012
    @rosslagerwall rosslagerwall mannequin assigned orsenthil Jul 8, 2012
    @jerub
    Copy link
    Mannequin

    jerub mannequin commented Jul 8, 2012

    Here's a followup patch that fixes the trunk build for me.

    This will unbreak the builds as well as fixing this bug, but it should be investigated why URLopener calls to_bytes() and Request does not. Ideally this interface should be consistent.

    @orsenthil
    Copy link
    Member

    It seems to me that toBytes in urllib was introduce to restrict the
    allowance of urls which were sent as unicode strings. We wanted urls
    to be ascii strings in Python2.

    http://mail.python.org/pipermail/python-bugs-list/2000-November/002779.html

    And quoting to toBytes / to_bytes is actually the problem here, as
    cookielib test cases is sending a unicode character which ascii
    encoding fails to operate on. I am thinking that we should arrive at a
    solution which brings consistency and fixes any previous mistakes. In
    3.3, I think, rework of to_bytes may also be a good solution, in 2.7
    and 3.2, I think stephen's attached patch is in good lines.
    Practically, the quote is more important than the failure at toBytes
    by sending an unicode url.

    @merwok
    Copy link
    Member

    merwok commented Jul 8, 2012

    I’m not sure urllib should accept invalid (non-escaped) URLs; a higher-level application can do so, but for the low-level stdlib module it is more debatable.

    @orsenthil
    Copy link
    Member

    Yeah, I am thinking so as well in that case, the test_cookielib.py
    test case may need a change.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jul 8, 2012

    New changeset ee1828dc3bf6 by Senthil Kumaran in branch '3.2':
    bpo-14826 - Address the buildbot failure ( explanation msg164973)
    http://hg.python.org/cpython/rev/ee1828dc3bf6

    New changeset dc30111a5d7e by Senthil Kumaran in branch 'default':
    bpo-14826 - Address the buildbot failure quote of url is the required change ( explanation msg164973)
    http://hg.python.org/cpython/rev/dc30111a5d7e

    New changeset 224b27a8d9be by Senthil Kumaran in branch '2.7':
    revert the changes done in d931a3b64fd6 - buildbot failure.
    http://hg.python.org/cpython/rev/224b27a8d9be

    @orsenthil
    Copy link
    Member

    The last change should settle the buildbots, But I would like to come
    back to this issue again tomorrow with focus - 3.3 to see if we can
    deal with removing to_bytes and then in 2.7 to see if something can
    done to test_cookielib.py test case.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 8, 2012

    Senthil, do you read python-dev? I think this change was prematurate from the start (nevermind the fact that you didn't run the test suite before committing).

    For example, if you have an URL with a non-ASCII domain name such as "http://وزارة-الأتصالات.مصر/", the domain name should IDNA-encoded, not %-encoded like the rest.

    Furthermore, some people are certainly already quoting their URLs to workaround this issue, so "fixing" it will break their code by double-escaping the URLs. You've got to be more careful.

    @tiran
    Copy link
    Member

    tiran commented Jul 8, 2012

    The docs [1] state that url should be a string containing a valid URL. An URL with a space ' ' is not a valid URL as the space must be quoted as %20. The brackets may also cause problems as they are not valid xs:anyURI chars.

    I vote for reverting the chances as they break the API. You could improve the docs and emphasize that URLs must be quoted correctly as the module doesn't implement browser magic.

    [1] http://docs.python.org/py3k/library/urllib.request.html#urllib.request.Request

    @orsenthil
    Copy link
    Member

    On Sun, Jul 8, 2012 at 2:30 AM, Antoine Pitrou <report@bugs.python.org> wrote:

    Senthil, do you read python-dev? I think this change was prematurate from the start (nevermind the fact that you didn't run the test suite before committing).

    I thought that the other legacy URLOpen was quoting it correct and
    then I wanted to see it can be made consistent.
    It did get me thinking that why it was different for so long. I
    realize that committing soon was a mistake.

    For example, if you have an URL with a non-ASCII domain name such as "http://وزارة-الأتصالات.مصر/", the domain name should IDNA-encoded, not %-encoded like the rest.

    Agreed and understood.

    Furthermore, some people are certainly already quoting their URLs to workaround this issue, so "fixing" it will break their code by double-escaping the URLs. You've got to be more careful.

    Oh. yes, the change may break an already quoted URL. I think, I shall
    revert this back.

    @orsenthil
    Copy link
    Member

    On Sun, Jul 8, 2012 at 9:42 AM, Christian Heimes <report@bugs.python.org> wrote:

    I vote for reverting the chances as they break the API. You could improve the docs and emphasize that URLs must be quoted correctly as the module doesn't implement browser magic.

    Okay. But I do realize that in 3.3, we may have a FancyURLOpener /
    URLOpener 's open method, which is not directly called by the apis,
    but they seem to have quote behavior. I guess, I approached this
    change as to making them consistent, but realize it is mistake, for
    the reasons that you state and Antoine state.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jul 9, 2012

    New changeset ebd37273e0fe by Senthil Kumaran in branch '3.2':
    revert the changes done for bpo-14826 - quoting witin Request is not desirable.
    http://hg.python.org/cpython/rev/ebd37273e0fe

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jul 9, 2012

    New changeset a4bdb637d818 by Senthil Kumaran in branch 'default':
    revert the changes done for bpo-14826 - quoting witin Request is not desirable.
    http://hg.python.org/cpython/rev/a4bdb637d818

    @vadmium
    Copy link
    Member

    vadmium commented Nov 28, 2015

    FWIW urlopen() already handles space characters in the Location target of redirects; see HTTPRedirectHandler.redirect_request(). So I think it is reasonable to handle space characters in user-supplied URLs also, if it is done properly.

    @vadmium
    Copy link
    Member

    vadmium commented Jun 21, 2016

    I think this should be treated as a feature, not a bug, since as Christian said, the documentation currently does not support this case.

    @vadmium vadmium changed the title urllib2.urlopen fails to load URL urlopen URL with unescaped space Jun 21, 2016
    @vadmium vadmium added the type-feature A feature request or enhancement label Jun 21, 2016
    @gpshead
    Copy link
    Member

    gpshead commented Apr 10, 2019

    urllib.request.URLopener() and FancyURLopener() automatically quote() URLs for the user. Those APIs are marked deprecated since 3.3 but have no timeline for removal.

    urllib.request.urlopen() does not use those, so URLs passed in are not auto-quoted.

    i'll clarify the docs for URLopener.

    @gpshead
    Copy link
    Member

    gpshead commented Apr 10, 2019

    New changeset 2fb2bc8 by Gregory P. Smith in branch 'master':
    bpo-14826: document that URLopener quotes fullurl. (GH-12758)
    2fb2bc8

    @miss-islington
    Copy link
    Contributor

    New changeset 9d2ccf1 by Miss Islington (bot) in branch '3.7':
    bpo-14826: document that URLopener quotes fullurl. (GH-12758)
    9d2ccf1

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @hauntsaninja
    Copy link
    Contributor

    It seems there's nothing left to be done here? We've documented the behaviour of the (deprecated) URLopener

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    8 participants