classification
Title: urllib.parse.urlparse accepts any falsy value as an url
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: demian.brecht, luiz.poleto, martin.panter, orsenthil, r.david.murray, serhiy.storchaka, ztane
Priority: normal Keywords: patch

Created on 2014-08-20 12:39 by ztane, last changed 2016-05-06 20:52 by serhiy.storchaka.

Files
File name Uploaded Description Edit
issue22234_36.patch luiz.poleto, 2016-04-26 04:24 Patch with the deprecation warning to be applied on Python 3.6. review
issue22234_37.patch luiz.poleto, 2016-04-26 04:27 Changes to urlparse to handle non-str and non-byte arguments review
urlparse_empty_bad_arg_deprecation.patch serhiy.storchaka, 2016-04-26 11:50 review
urlparse_empty_bad_arg_disallow.patch serhiy.storchaka, 2016-04-26 12:30 review
urlparse_empty_bad_arg_deprecation2.patch serhiy.storchaka, 2016-04-26 13:22 review
Messages (17)
msg225566 - (view) Author: Antti Haapala (ztane) * Date: 2014-08-20 12:39
Because of "if x else ''" in _decode_args (http://hg.python.org/cpython/file/3.4/Lib/urllib/parse.py#l96), urllib.parse.urlparse accepts any falsy value as an url, returning a ParseResultBytes with all members set to empty bytestrings.

Thus you get:

    >>> urllib.parse.urlparse({})
    ParseResultBytes(scheme=b'', netloc=b'', path=b'', params=b'', query=b'', fragment=b'')

which may result in some very confusing exceptions later on: I had a list of URLs that accidentally contained some Nones and got very confusing TypeErrors while processing the results expecting them to be strings.

If the `if x else ''` part were removed, such invalid falsy values would fail with `AttributeError: 'foo' object has no attribute 'decode'`, as happens with any truthy invalid value.
msg225832 - (view) Author: Antti Haapala (ztane) * Date: 2014-08-24 18:10
On Python 2.7 urlparse.urlparse, parsing None, () or 0 will throw AttributeError because these classes do not have any 'find' method. [] has the find method, but will fail with TypeError, because the built-in caching requires that the input be hashable.
msg260671 - (view) Author: Antti Haapala (ztane) * Date: 2016-02-22 11:59
I believe `urlparse` should throw a `TypeError` if not isinstance(url, (str, bytes))
msg264023 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-04-22 15:43
I just posted about this on the mentors list, where someone brought this issue up with a question about our policy on type checking.  The short version is the better (preserves duck-typing) and more backward compatibile fix is to change the test to be

   x != ''

That will result in an attribute error on 'decode' for values of the incorrect type.

But even that should go through a deperecation cycle, since there may be working programs depending on the current behavior.  It's worth fixing, though, because of the error propogation you report.  I also suggested a rewrite of _check_args to get a better error message that would indeed be a type error, and I'm anticipating someone from the mentors list will turn that into a patch here.
msg264210 - (view) Author: Luiz Poleto (luiz.poleto) * Date: 2016-04-26 04:24
As discussed on the Mentors list, the attached patch (issue22234_36.patch) includes the deprecation warning (and related test) on the urlparse function.
msg264211 - (view) Author: Luiz Poleto (luiz.poleto) * Date: 2016-04-26 04:27
As discussed on the Mentors list, the attached patch (issue22234_37.patch) changes the urlparse function to handle non-str and non-bytes arguments and adds a new test case for it.
msg264217 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-26 05:55
`x != ''` emits BytesWarning if x is bytes. 'encode' attribute is not needed for URL parsing. any() is slower that a `for` loop.

I would suggest to look at efficient os.path implementations.
msg264219 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-04-26 06:39
Luiz: Your _36 patch looks like it adds an unconditional warning whenever urlparse() is called, but I would have expected it to depend on the type of the “url” parameter.

There are related functions that seem to accept false values like None in Python 3, but not in Python 2. Perhaps they should also be considered with any changes:

urlsplit(None)
parse_qs(None)
parse_qsl(None)
urldefrag(None)

Also, I wonder if we should continue to accept bytearray as well as bytes. Bytearray has a decode() method.
msg264261 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-26 11:50
Here is a patch that deprecates empty non-str and non-decodable arguments for urlparse, urlsplit, urlunparse, urlunsplit, urldefrag, and parse_qsl.
msg264269 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-26 12:30
And here is simpler patch that just disallows bad arguments without deprecation.
msg264278 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-26 13:22
Updated patch addresses Martin's comment.
msg264344 - (view) Author: Luiz Poleto (luiz.poleto) * Date: 2016-04-27 03:50
I am seeing some results when running urlparse with patch urlparse_empty_bad_arg_deprecation2.patch applied:

>>> urllib.parse.urlparse({})
__main__:1: DeprecationWarning: Use of {} is deprecated
__main__:1: DeprecationWarning: Use of '' is deprecated
ParseResultBytes(scheme=b'', netloc=b'', path=b'', params=b'', query=b'', fragment=b'')

>>> urllib.parse.urlparse('', b'')
__main__:1: DeprecationWarning: Use of b'' is deprecated
/home/poleto/SCMws/python/latest/cpython/Lib/urllib/parse.py:378: DeprecationWarning: Use of b'' is deprecated
  splitresult = urlsplit(url, scheme, allow_fragments)
ParseResult(scheme=b'', netloc='', path='', params='', query='', fragment='')
Will bytes be deprecated if used as a default_schema?

>>> urllib.parse.urlparse(b'', '')
ParseResultBytes(scheme=b'', netloc=b'', path=b'', params=b'', query=b'', fragment=b'')
Shouldn't it complain that the types are different? In fact it does, if you don't provide empty strings:

>>> urllib.parse.urlparse(b'www.python.org', 'http')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "(...)/cpython/Lib/urllib/parse.py", line 377, in urlparse
    url, scheme, _coerce_result = _coerce_args(url, scheme)
  File "(...)/cpython/Lib/urllib/parse.py", line 120, in _coerce_args
    raise TypeError("Cannot mix str and non-str arguments")
TypeError: Cannot mix str and non-str arguments

>>> urllib.parse.urlparse({'a' : 1})
__main__:1: DeprecationWarning: Use of '' is deprecated
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "(...)/cpython/Lib/urllib/parse.py", line 377, in urlparse
    url, scheme, _coerce_result = _coerce_args(url, scheme)
  File "(...)/cpython/Lib/urllib/parse.py", line 128, in _coerce_args
    return _decode_args(args) + (_encode_result,)
  File "(...)/cpython/Lib/urllib/parse.py", line 98, in _decode_args
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
  File "(...)/cpython/Lib/urllib/parse.py", line 98, in <genexpr>
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
AttributeError: 'dict' object has no attribute 'decode'

>>> urllib.parse.urlparse(['a', 'b', 'c'])
__main__:1: DeprecationWarning: Use of [] is deprecated
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "(...)/cpython/Lib/urllib/parse.py", line 377, in urlparse
    url, scheme, _coerce_result = _coerce_args(url, scheme)
  File "(...)/cpython/Lib/urllib/parse.py", line 128, in _coerce_args
    return _decode_args(args) + (_encode_result,)
  File "(...)/cpython/Lib/urllib/parse.py", line 98, in _decode_args
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
  File "(...)/cpython/Lib/urllib/parse.py", line 98, in <genexpr>
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
AttributeError: 'list' object has no attribute 'decode'

I thought about writing test cases but I wasn't a 100% sure if the above is working as expected so I thought I should ask first.
msg264346 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-04-27 03:59
Regarding urlparse(b'', ''). Currently the second parameter is “scheme”, which is documented as being an empty text string by default. If we deprecate this, we should update the documentation.
msg264347 - (view) Author: Luiz Poleto (luiz.poleto) * Date: 2016-04-27 04:00
As for urlparse_empty_bad_arg_disallow.patch, I didn't go too deep into testing it but I found that calling urlparse with different non-str args are producing different results:

urlparse({})
TypeError: unhashable type: 'slice'

urlparse([])
AttributeError: 'list' object has no attribute 'decode'

urlparse(())
AttributeError: 'tuple' object has no attribute 'decode'

I thought they should all raise a TypeError but again, I am not sure it is working as expected by the patch's author.
msg264357 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2016-04-27 06:39
Serhiy:

I left review comments on the patch too. 
I agree to "tightening" of the input arg type in these urlparse functions. 

Before we for the next version, I think, it will be helpful to enumerate the behavior for wrong arg types for these functions that you would like to see.

1. Invalid formats like {}, [], None could just be TypeError.

2. The mix of bytes and str should be deprecated and we could give the suggestion for the encouraged single type in the deprecation warning.

3. Any thing else w.r.t to special rules for various parts of url.

In general, if we are going with the deprecation cycle, we would as well go with deciding on what to allow and present it in a simple way.
msg264427 - (view) Author: Antti Haapala (ztane) * Date: 2016-04-28 11:02
I do not believe there is code that would depend on `urlparse(urlstring={})` *not* throwing an error, since `{}` obviously is neither a URL, nor a string.

Further down the documentation explicitly states that

> The URL parsing functions were originally designed to operate on 
> character strings only. In practice, it is useful to be able to 
> manipulate properly quoted and encoded URLs as sequences of ASCII 
> bytes. Accordingly, the URL parsing functions in this module all 
> operate on bytes and bytearray objects in addition to str objects.

As the documentation does not state that it should work on any other objects, there shouldn't be any code that should be deprecated. Furthermore even in 3.5, the `bool(datetime.time(0, 0)) == False` was removed without any deprecation cycle, despite it having been a documented feature for more than a decade (unlike this one).

And IMHO not giving an object of expected type should result in a TypeError.
msg265030 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-05-06 20:52
Thank you Luiz for your testing.

> __main__:1: DeprecationWarning: Use of '' is deprecated

It is bad that a warning is emitted for default value.

> Will bytes be deprecated if used as a default_schema?

No, only using empty bytes schema with string url is deprecated (because it works now). Using non-empty bytes schema with string url just causes an error.

> Shouldn't it complain that the types are different?

This special case is left for compatibility with wrappers.

> __main__:1: DeprecationWarning: Use of [] is deprecated

The warning should not be emitted for the value that the user did not provide.

If go by the way of strong deprecation, the patch needs reworking. But this is a way of overcomplication.

Since as pointed Antti, only using str, bytes and bytearray is documented, I think we can ignore the breakage for other types.
History
Date User Action Args
2016-05-06 20:52:39serhiy.storchakasetmessages: + msg265030
2016-04-28 11:02:54ztanesetmessages: + msg264427
2016-04-27 06:39:03orsenthilsetmessages: + msg264357
2016-04-27 04:00:07luiz.poletosetmessages: + msg264347
2016-04-27 03:59:23martin.pantersetmessages: + msg264346
2016-04-27 03:50:06luiz.poletosetmessages: + msg264344
2016-04-26 13:23:00serhiy.storchakasetfiles: + urlparse_empty_bad_arg_deprecation2.patch

messages: + msg264278
2016-04-26 12:57:17anish.shahsetnosy: - anish.shah
2016-04-26 12:30:05serhiy.storchakasetfiles: + urlparse_empty_bad_arg_disallow.patch

messages: + msg264269
2016-04-26 11:50:10serhiy.storchakasetfiles: + urlparse_empty_bad_arg_deprecation.patch

messages: + msg264261
2016-04-26 06:39:02martin.pantersetmessages: + msg264219
stage: patch review
2016-04-26 05:55:15serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg264217
2016-04-26 04:27:05luiz.poletosetfiles: + issue22234_37.patch

messages: + msg264211
2016-04-26 04:24:42luiz.poletosetfiles: + issue22234_36.patch
keywords: + patch
messages: + msg264210

versions: + Python 3.6, - Python 3.4
2016-04-22 15:43:50r.david.murraysetnosy: + r.david.murray
messages: + msg264023
2016-04-22 03:29:52luiz.poletosetnosy: + luiz.poleto
2016-02-22 16:43:07anish.shahsetnosy: + anish.shah
2016-02-22 11:59:17ztanesetmessages: + msg260671
2014-08-24 18:10:36ztanesetmessages: + msg225832
2014-08-21 22:36:14martin.pantersetnosy: + martin.panter
2014-08-20 21:51:05demian.brechtsetnosy: + demian.brecht
2014-08-20 15:17:17berker.peksagsetnosy: + orsenthil
2014-08-20 12:39:51ztanecreate