classification
Title: Inconsistent/undocumented urlsplit/urlparse behavior on invalid inputs
Type: behavior Stage: patch review
Components: Documentation, Library (Lib) Versions: Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Howie Benefiel, docs@python, orsenthil, rhettinger, vfaronov
Priority: normal Keywords:

Created on 2017-02-25 19:45 by vfaronov, last changed 2017-04-15 00:23 by berker.peksag.

Pull Requests
URL Status Linked Edit
PR 1128 open python-dev, 2017-04-14 05:38
Messages (3)
msg288577 - (view) Author: Vasiliy Faronov (vfaronov) Date: 2017-02-25 19:45
There is a problem with the standard library's urlsplit and urlparse functions, in Python 2.7 (module urlparse) and 3.2+ (module urllib.parse).

The documentation for these functions [1] does not explain how they behave when given an invalid URL.

One could try invoking them manually and conclude that they tolerate anything thrown at them:

>>> urlparse('http:////::\\\\!!::!!++///')
ParseResult(scheme='http', netloc='', path='//::\\\\!!::!!++///',
params='', query='', fragment='')

>>> urlparse(os.urandom(32).decode('latin-1'))
ParseResult(scheme='', netloc='', path='\x7f¼â1gdä»6\x82', params='',
query='', fragment='\n\xadJ\x18+fli\x9cÛ\x9ak*ÄÅ\x02³F\x85Ç\x18')

Without studying the source code, it is impossible to know that there is a very narrow class of inputs on which they raise ValueError [2]:

>>> urlparse('http://[')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/urllib/parse.py", line 295, in urlparse
    splitresult = urlsplit(url, scheme, allow_fragments)
  File "/usr/lib/python3.5/urllib/parse.py", line 345, in urlsplit
    raise ValueError("Invalid IPv6 URL")
ValueError: Invalid IPv6 URL

This could be viewed as a documentation issue. But it could also be viewed as an implementation issue. Instead of raising ValueError on those square brackets, urlsplit could simply consider them *invalid* parts of an RFC 3986 reg-name, and lump them into netloc, as it already does with other *invalid* characters:

>>> urlparse('http://\0\0æí\n/')
ParseResult(scheme='http', netloc='\x00\x00æí\n', path='/', params='',
query='', fragment='')

Note that the raising behavior was introduced in Python 2.7/3.2.

See also issue 8721 [3].


[1] https://docs.python.org/3/library/urllib.parse.html
[2] https://github.com/python/cpython/blob/e32ec93/Lib/urllib/parse.py#L406-L408
[3] http://bugs.python.org/issue8721
msg288959 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-03-04 05:15
A note in the docs would be useful.  This API is far too well established to make any behavioral changes at this point.
msg291640 - (view) Author: Howie Benefiel (Howie Benefiel) * Date: 2017-04-14 05:07
I'm going to make a note in the documentation. I should have a PR for it in about 1 day.
History
Date User Action Args
2017-04-15 00:23:13berker.peksagsetstage: needs patch -> patch review
versions: + Python 3.5
2017-04-14 05:38:02python-devsetpull_requests: + pull_request1263
2017-04-14 05:07:23Howie Benefielsetnosy: + Howie Benefiel
messages: + msg291640
2017-03-04 05:15:17rhettingersetnosy: + rhettinger
messages: + msg288959
2017-03-03 21:21:09terry.reedysetnosy: + orsenthil
stage: needs patch

versions: - Python 3.3, Python 3.4, Python 3.5
2017-02-25 19:45:27vfaronovcreate