classification
Title: Add type checks to urllib.request.Request
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: CuriousLearner Nosy List: CuriousLearner, Nan Wu, ezio.melotti, martin.panter
Priority: normal Keywords: easy, patch

Created on 2015-10-19 09:31 by ezio.melotti, last changed 2018-11-21 11:51 by CuriousLearner.

Files
File name Uploaded Description Edit
urllib_request_param_type_check.patch Nan Wu, 2015-10-20 16:27 review
urllib_request_param_type_check_2.patch Nan Wu, 2015-10-20 22:44 review
urllib_request_param_type_check_3.patch Nan Wu, 2015-10-23 16:50
urllib_request_param_type_check_4.patch Nan Wu, 2015-10-31 19:01 review
Pull Requests
URL Status Linked Edit
PR 10616 open CuriousLearner, 2018-11-20 18:51
Messages (16)
msg253173 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2015-10-19 09:31
Currently urllib.request.Request seems to accept invalid types silently, only to fail later on with unhelpful errors when the request is passed to urlopen.

This might cause users to go through something similar:

>>> r = Request(b'https://www.python.org')
>>> # so far, so good
>>> urlopen(r)
Traceback (most recent call last):
  ...
urllib.error.URLError: <urlopen error unknown url type: b'https>

This unhelpful error might lead users to think https is not supported, whereas the problem is that the url should have been str, not bytes.

The same problem applies to post data:

>>> r = Request('https://www.python.org', {'post': 'data'})
>>> # so far, so good
>>> urlopen(r)
Traceback (most recent call last):
  ...
TypeError: memoryview: dict object does not have the buffer interface
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  ...
ValueError: Content-Length should be specified for iterable data of type <class 'dict'> {'post': 'data'}

This error seems to indicate that Content-Length should be specified somewhere, but AFAICT from the docs, the data should be bytes or None -- so let's try to urlencode them:

>>> r = Request('https://www.python.org', urlencode({'post': 'data'}))
>>> # so far, so good
>>> urlopen(r)
Traceback (most recent call last):
  ...
TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.

OK, maybe I should use bytes in the dict:

>>> r = Request('https://www.python.org', urlencode({b'post': b'data'}))
>>> # so far, so good
>>> urlopen(r)
Traceback (most recent call last):
  ...
TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.

That didn't work, I also needed to encode the output of urlencode().


Most of these problems could be prevented if Request() raised for non-str URLs, and non-bytes (and non-None) POST data.  Unless there some valid reason to accept invalid types, I think they should be rejected early.
msg253175 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-19 10:34
Type checking on the URL should be reasonably straightforward.

For the request data, the question is what types do we currently support? This may be a can of worms. We would have to be careful not to break existing use cases, even undocumented ones.

Looking at Issue 23740, the underlying “http.client” accepts a variety of data types (bytes, Latin-1 str, iterables, files), and this support varies depending if the Content-Length header is supplied. I don’t think all combinations work with urlopen(), but judging by Issue 5038, there is at least partial support for bytes-like, iterables, and file objects.
msg253176 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2015-10-19 10:44
> For the request data, the question is what types do we currently support?

This is what I was wondering as well.  The doc says "data must be a bytes object specifying additional data to send to the server, or None if no such data is needed."[0]
However one of the error messages says "POST data should be bytes or an iterable of bytes." and "Content-Length should be specified for iterable data" so iterables of bytes also seem to be supported.
One option is to simply reject str and dict (or mappings in general), since these seem to me the two most likely errors (the first in case the user forgot to encode the output of urlencode(), the second in case the user forgot urlencode() altogether and passed a dict).

[0]: https://docs.python.org/3/library/urllib.request.html#urllib.request.Request
msg253231 - (view) Author: Nan Wu (Nan Wu) * Date: 2015-10-20 16:27
Uploaded a patch. Also update the test. Seems there are at least two old test cases using empty dict as data param. has dict been supported before?
msg253252 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-20 21:04
The iterable option is documented under the urlopen(data=...) parameter, albeit not under the Request.data attribute. IMO this isn’t right, because you need to use Request to add Content-Length, but that is a separate documentation issue.

Technically, an empty dict() and a dictionary of bytes keys are both iterables of bytes. I admit it is a strange case though; this is one of those canned worms I mentioned. Example:

headers = {"Content-Length": 6}
data = OrderedDict(((b"abc", 1), (b"def", 2))))
urlopen(Request("http://localhost/", headers=headers, data=data))
# Sends request data b"abcdef"

Using the annotate function <https://hg.python.org/cpython/annotate/0a0aafaa9bf5/Lib/test/test_urllib.py#l1167> reveals that your empty dict() tests were added by revision 0a0aafaa9bf5. I suspect it was an accident that happened to work, which is really an argument for diagnosing passing in a dict, although as I mentioned this is technically breaking compatibility.
msg253267 - (view) Author: Nan Wu (Nan Wu) * Date: 2015-10-20 22:44
I see. Empty dict is considered as iterable of bytes makes good sense. I just left the old tests there. And explicitly give warnings when data field is of type str or type dict with non-bytes key.
msg253519 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-27 05:18
There is already a check for request.data being str in do_request_(). Maybe it is okay to remove this check; I think it would be equivalent to the new check.

Regarding the dict check, here is a strange class that currently works, but would be broken by the patch. Maybe it is okay to break it though, since it certainly goes against the urlopen() documentation.

class ReaderMapping(dict):
    def __init__(self):
        super().__init__({"abc": "xyz"})
        self.mode = "r"
        self.data = "123"
    def read(self, size):
        data = self.data
        self.data = ""
        return data

urlopen(Request("http://localhost/", headers={"Content-Length": 3}, data=ReaderMapping()))

If we wanted to continue to support this kind of usage, perhaps the safest option would be to change the test to “if type(data) is dict” rather than use isinstance().

I also left a comment about eliminating urlencode() in the test.
msg253585 - (view) Author: Nan Wu (Nan Wu) * Date: 2015-10-28 03:32
The do_request_() method which is defined in AbstractHTTPHandler seems not cover the check at least for the first case Ezio brought up. `unknown_open` has been called and gives out a relatively confusing message.
msg253586 - (view) Author: Nan Wu (Nan Wu) * Date: 2015-10-28 04:26
Will fix the other two issues. Thanks.
msg253587 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-28 05:07
I meant removing one of the checks in AbstractHTTPHandler.do_request_(). I left a note in the review at the relevant line.

One more thing I forgot to mention, although I am afraid it is 90% nit picking the dict iterable check. For the “http:” scheme, an iterable of “bytes-like” objects is supported (again not documented though). For HTTPS (over SSL), not every bytes-like object is supported, but common ones like bytearray() still work I think.
msg253802 - (view) Author: Nan Wu (Nan Wu) * Date: 2015-10-31 19:01
Martin: Sorry for missing that line.

Under https, byte iterable seems has not been supported:
>>> r = Request('https://www.python.org', {b'post': 'data'}, {'Content-Length':10})
>>> urlopen(r)

... hanging here...

Meanwhile, I assumed bytearray works as expected.
>>> r = Request('https://www.python.org', bytearray('post=data', 'utf-8'))
>>> urlopen(r)
....
....
urllib.error.HTTPError: HTTP Error 403: FORBIDDEN

In *4.patch, I updated Request class doc to add these observations and fixed issues appeared in last patch.
msg253814 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-31 22:44
The bytes-like thing applies equally well to both the direct and iterable cases. Your first attempt hangs because the client only sends the four bytes in b"post", but the server is waiting for a total of 10 bytes.

The SSL problem doesn’t show up unless you use more obscure types like ctypes or array.array, where len() does not work as expected. Here is a demo:

>>> import ctypes
>>> byteslike = ctypes.c_char(b"x")
>>> urlopen("http://python.org/", data=byteslike)  # Succeeds
<http.client.HTTPResponse object at 0xb6bc876c>
>>> urlopen("https://python.org/", data=byteslike)
  File "/usr/lib/python3.4/ssl.py", line 720, in sendall
    amount = len(data)
TypeError: object of type 'c_char' has no len()

During handling of the above exception, another exception occurred:

TypeError: data should be a bytes-like object or an iterable, got <class 'ctypes.c_char'>
msg253815 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-31 23:15
This illustrates my concern with the dict-of-bytes check:

>>> headers = {"Content-Length": "3"}
>>> byteslike_iterable = {memoryview(b"123"): None}
>>> urlopen(Request("http://python.org/", headers=headers, data=byteslike_iterable))

With the current patch I think this would become an error. But maybe I am being too nitpicky and paranoid. Ezio?
msg257266 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2016-01-01 02:49
> But maybe I am being too nitpicky and paranoid. Ezio?

IIUC your last example used to work because technically it is an iterable of bytes, but the patch would give an error instead.  Even in the unlikely case someone is relying on this behavior, simply adding .keys() not only should make the code work again, but also make it more explicit and understandable.
msg330136 - (view) Author: Sanyam Khurana (CuriousLearner) * (Python triager) Date: 2018-11-20 17:19
I'm working on applying the patch cleanly to master branch for this.

Also adding 3.8 and 3.7 as candidates for the bug.
msg330192 - (view) Author: Sanyam Khurana (CuriousLearner) * (Python triager) Date: 2018-11-21 11:51
PR is up for a review: https://github.com/python/cpython/pull/10616
History
Date User Action Args
2018-11-21 11:51:33CuriousLearnersetmessages: + msg330192
2018-11-20 18:51:43CuriousLearnersetpull_requests: + pull_request9862
2018-11-20 17:19:13CuriousLearnersetmessages: + msg330136
versions: + Python 3.7, Python 3.8
2018-11-20 17:18:04CuriousLearnersetassignee: CuriousLearner

nosy: + CuriousLearner
2016-01-01 02:49:47ezio.melottisetmessages: + msg257266
2015-10-31 23:15:51martin.pantersetmessages: + msg253815
2015-10-31 22:44:01martin.pantersetmessages: + msg253814
2015-10-31 19:01:15Nan Wusetfiles: + urllib_request_param_type_check_4.patch

messages: + msg253802
2015-10-28 05:07:07martin.pantersetmessages: + msg253587
2015-10-28 04:26:56Nan Wusetmessages: + msg253586
2015-10-28 03:32:23Nan Wusetmessages: + msg253585
2015-10-27 05:18:42martin.pantersetmessages: + msg253519
stage: needs patch -> patch review
2015-10-23 16:50:39Nan Wusetfiles: + urllib_request_param_type_check_3.patch
2015-10-20 22:44:18Nan Wusetfiles: + urllib_request_param_type_check_2.patch

messages: + msg253267
2015-10-20 21:04:58martin.pantersetmessages: + msg253252
2015-10-20 16:27:39Nan Wusetfiles: + urllib_request_param_type_check.patch

nosy: + Nan Wu
messages: + msg253231

keywords: + patch
2015-10-19 10:44:49ezio.melottisetmessages: + msg253176
2015-10-19 10:34:15martin.pantersetnosy: + martin.panter
messages: + msg253175
2015-10-19 09:31:23ezio.melotticreate