msg253173 - (view) |
Author: Ezio Melotti (ezio.melotti) * |
Date: 2015-10-19 09:31 |
Currently urllib.request.Request seems to accept invalid types silently, only to fail later on with unhelpful errors when the request is passed to urlopen.
This might cause users to go through something similar:
>>> r = Request(b'https://www.python.org')
>>> # so far, so good
>>> urlopen(r)
Traceback (most recent call last):
...
urllib.error.URLError: <urlopen error unknown url type: b'https>
This unhelpful error might lead users to think https is not supported, whereas the problem is that the url should have been str, not bytes.
The same problem applies to post data:
>>> r = Request('https://www.python.org', {'post': 'data'})
>>> # so far, so good
>>> urlopen(r)
Traceback (most recent call last):
...
TypeError: memoryview: dict object does not have the buffer interface
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
...
ValueError: Content-Length should be specified for iterable data of type <class 'dict'> {'post': 'data'}
This error seems to indicate that Content-Length should be specified somewhere, but AFAICT from the docs, the data should be bytes or None -- so let's try to urlencode them:
>>> r = Request('https://www.python.org', urlencode({'post': 'data'}))
>>> # so far, so good
>>> urlopen(r)
Traceback (most recent call last):
...
TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.
OK, maybe I should use bytes in the dict:
>>> r = Request('https://www.python.org', urlencode({b'post': b'data'}))
>>> # so far, so good
>>> urlopen(r)
Traceback (most recent call last):
...
TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.
That didn't work, I also needed to encode the output of urlencode().
Most of these problems could be prevented if Request() raised for non-str URLs, and non-bytes (and non-None) POST data. Unless there some valid reason to accept invalid types, I think they should be rejected early.
|
msg253175 - (view) |
Author: Martin Panter (martin.panter) * |
Date: 2015-10-19 10:34 |
Type checking on the URL should be reasonably straightforward.
For the request data, the question is what types do we currently support? This may be a can of worms. We would have to be careful not to break existing use cases, even undocumented ones.
Looking at Issue 23740, the underlying “http.client” accepts a variety of data types (bytes, Latin-1 str, iterables, files), and this support varies depending if the Content-Length header is supplied. I don’t think all combinations work with urlopen(), but judging by Issue 5038, there is at least partial support for bytes-like, iterables, and file objects.
|
msg253176 - (view) |
Author: Ezio Melotti (ezio.melotti) * |
Date: 2015-10-19 10:44 |
> For the request data, the question is what types do we currently support?
This is what I was wondering as well. The doc says "data must be a bytes object specifying additional data to send to the server, or None if no such data is needed."[0]
However one of the error messages says "POST data should be bytes or an iterable of bytes." and "Content-Length should be specified for iterable data" so iterables of bytes also seem to be supported.
One option is to simply reject str and dict (or mappings in general), since these seem to me the two most likely errors (the first in case the user forgot to encode the output of urlencode(), the second in case the user forgot urlencode() altogether and passed a dict).
[0]: https://docs.python.org/3/library/urllib.request.html#urllib.request.Request
|
msg253231 - (view) |
Author: Nan Wu (Nan Wu) * |
Date: 2015-10-20 16:27 |
Uploaded a patch. Also update the test. Seems there are at least two old test cases using empty dict as data param. has dict been supported before?
|
msg253252 - (view) |
Author: Martin Panter (martin.panter) * |
Date: 2015-10-20 21:04 |
The iterable option is documented under the urlopen(data=...) parameter, albeit not under the Request.data attribute. IMO this isn’t right, because you need to use Request to add Content-Length, but that is a separate documentation issue.
Technically, an empty dict() and a dictionary of bytes keys are both iterables of bytes. I admit it is a strange case though; this is one of those canned worms I mentioned. Example:
headers = {"Content-Length": 6}
data = OrderedDict(((b"abc", 1), (b"def", 2))))
urlopen(Request("http://localhost/", headers=headers, data=data))
# Sends request data b"abcdef"
Using the annotate function <https://hg.python.org/cpython/annotate/0a0aafaa9bf5/Lib/test/test_urllib.py#l1167> reveals that your empty dict() tests were added by revision 0a0aafaa9bf5. I suspect it was an accident that happened to work, which is really an argument for diagnosing passing in a dict, although as I mentioned this is technically breaking compatibility.
|
msg253267 - (view) |
Author: Nan Wu (Nan Wu) * |
Date: 2015-10-20 22:44 |
I see. Empty dict is considered as iterable of bytes makes good sense. I just left the old tests there. And explicitly give warnings when data field is of type str or type dict with non-bytes key.
|
msg253519 - (view) |
Author: Martin Panter (martin.panter) * |
Date: 2015-10-27 05:18 |
There is already a check for request.data being str in do_request_(). Maybe it is okay to remove this check; I think it would be equivalent to the new check.
Regarding the dict check, here is a strange class that currently works, but would be broken by the patch. Maybe it is okay to break it though, since it certainly goes against the urlopen() documentation.
class ReaderMapping(dict):
def __init__(self):
super().__init__({"abc": "xyz"})
self.mode = "r"
self.data = "123"
def read(self, size):
data = self.data
self.data = ""
return data
urlopen(Request("http://localhost/", headers={"Content-Length": 3}, data=ReaderMapping()))
If we wanted to continue to support this kind of usage, perhaps the safest option would be to change the test to “if type(data) is dict” rather than use isinstance().
I also left a comment about eliminating urlencode() in the test.
|
msg253585 - (view) |
Author: Nan Wu (Nan Wu) * |
Date: 2015-10-28 03:32 |
The do_request_() method which is defined in AbstractHTTPHandler seems not cover the check at least for the first case Ezio brought up. `unknown_open` has been called and gives out a relatively confusing message.
|
msg253586 - (view) |
Author: Nan Wu (Nan Wu) * |
Date: 2015-10-28 04:26 |
Will fix the other two issues. Thanks.
|
msg253587 - (view) |
Author: Martin Panter (martin.panter) * |
Date: 2015-10-28 05:07 |
I meant removing one of the checks in AbstractHTTPHandler.do_request_(). I left a note in the review at the relevant line.
One more thing I forgot to mention, although I am afraid it is 90% nit picking the dict iterable check. For the “http:” scheme, an iterable of “bytes-like” objects is supported (again not documented though). For HTTPS (over SSL), not every bytes-like object is supported, but common ones like bytearray() still work I think.
|
msg253802 - (view) |
Author: Nan Wu (Nan Wu) * |
Date: 2015-10-31 19:01 |
Martin: Sorry for missing that line.
Under https, byte iterable seems has not been supported:
>>> r = Request('https://www.python.org', {b'post': 'data'}, {'Content-Length':10})
>>> urlopen(r)
... hanging here...
Meanwhile, I assumed bytearray works as expected.
>>> r = Request('https://www.python.org', bytearray('post=data', 'utf-8'))
>>> urlopen(r)
....
....
urllib.error.HTTPError: HTTP Error 403: FORBIDDEN
In *4.patch, I updated Request class doc to add these observations and fixed issues appeared in last patch.
|
msg253814 - (view) |
Author: Martin Panter (martin.panter) * |
Date: 2015-10-31 22:44 |
The bytes-like thing applies equally well to both the direct and iterable cases. Your first attempt hangs because the client only sends the four bytes in b"post", but the server is waiting for a total of 10 bytes.
The SSL problem doesn’t show up unless you use more obscure types like ctypes or array.array, where len() does not work as expected. Here is a demo:
>>> import ctypes
>>> byteslike = ctypes.c_char(b"x")
>>> urlopen("http://python.org/", data=byteslike) # Succeeds
<http.client.HTTPResponse object at 0xb6bc876c>
>>> urlopen("https://python.org/", data=byteslike)
File "/usr/lib/python3.4/ssl.py", line 720, in sendall
amount = len(data)
TypeError: object of type 'c_char' has no len()
During handling of the above exception, another exception occurred:
TypeError: data should be a bytes-like object or an iterable, got <class 'ctypes.c_char'>
|
msg253815 - (view) |
Author: Martin Panter (martin.panter) * |
Date: 2015-10-31 23:15 |
This illustrates my concern with the dict-of-bytes check:
>>> headers = {"Content-Length": "3"}
>>> byteslike_iterable = {memoryview(b"123"): None}
>>> urlopen(Request("http://python.org/", headers=headers, data=byteslike_iterable))
With the current patch I think this would become an error. But maybe I am being too nitpicky and paranoid. Ezio?
|
msg257266 - (view) |
Author: Ezio Melotti (ezio.melotti) * |
Date: 2016-01-01 02:49 |
> But maybe I am being too nitpicky and paranoid. Ezio?
IIUC your last example used to work because technically it is an iterable of bytes, but the patch would give an error instead. Even in the unlikely case someone is relying on this behavior, simply adding .keys() not only should make the code work again, but also make it more explicit and understandable.
|
msg330136 - (view) |
Author: Sanyam Khurana (CuriousLearner) * |
Date: 2018-11-20 17:19 |
I'm working on applying the patch cleanly to master branch for this.
Also adding 3.8 and 3.7 as candidates for the bug.
|
msg330192 - (view) |
Author: Sanyam Khurana (CuriousLearner) * |
Date: 2018-11-21 11:51 |
PR is up for a review: https://github.com/python/cpython/pull/10616
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:22 | admin | set | github: 69625 |
2018-11-21 11:51:33 | CuriousLearner | set | messages:
+ msg330192 |
2018-11-20 18:51:43 | CuriousLearner | set | pull_requests:
+ pull_request9862 |
2018-11-20 17:19:13 | CuriousLearner | set | messages:
+ msg330136 versions:
+ Python 3.7, Python 3.8 |
2018-11-20 17:18:04 | CuriousLearner | set | assignee: CuriousLearner
nosy:
+ CuriousLearner |
2016-01-01 02:49:47 | ezio.melotti | set | messages:
+ msg257266 |
2015-10-31 23:15:51 | martin.panter | set | messages:
+ msg253815 |
2015-10-31 22:44:01 | martin.panter | set | messages:
+ msg253814 |
2015-10-31 19:01:15 | Nan Wu | set | files:
+ urllib_request_param_type_check_4.patch
messages:
+ msg253802 |
2015-10-28 05:07:07 | martin.panter | set | messages:
+ msg253587 |
2015-10-28 04:26:56 | Nan Wu | set | messages:
+ msg253586 |
2015-10-28 03:32:23 | Nan Wu | set | messages:
+ msg253585 |
2015-10-27 05:18:42 | martin.panter | set | messages:
+ msg253519 stage: needs patch -> patch review |
2015-10-23 16:50:39 | Nan Wu | set | files:
+ urllib_request_param_type_check_3.patch |
2015-10-20 22:44:18 | Nan Wu | set | files:
+ urllib_request_param_type_check_2.patch
messages:
+ msg253267 |
2015-10-20 21:04:58 | martin.panter | set | messages:
+ msg253252 |
2015-10-20 16:27:39 | Nan Wu | set | files:
+ urllib_request_param_type_check.patch
nosy:
+ Nan Wu messages:
+ msg253231
keywords:
+ patch |
2015-10-19 10:44:49 | ezio.melotti | set | messages:
+ msg253176 |
2015-10-19 10:34:15 | martin.panter | set | nosy:
+ martin.panter messages:
+ msg253175
|
2015-10-19 09:31:23 | ezio.melotti | create | |