This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Inconsistent/undocumented urlsplit/urlparse behavior on invalid inputs
Type: behavior Stage: resolved
Components: Documentation, Library (Lib) Versions: Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Howie Benefiel, docs@python, orsenthil, rhettinger, vfaronov
Priority: normal Keywords:

Created on 2017-02-25 19:45 by vfaronov, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 1128 merged python-dev, 2017-04-14 05:38
PR 1596 merged orsenthil, 2017-05-16 05:01
PR 1597 merged orsenthil, 2017-05-16 05:09
Messages (6)
msg288577 - (view) Author: Vasiliy Faronov (vfaronov) Date: 2017-02-25 19:45
There is a problem with the standard library's urlsplit and urlparse functions, in Python 2.7 (module urlparse) and 3.2+ (module urllib.parse).

The documentation for these functions [1] does not explain how they behave when given an invalid URL.

One could try invoking them manually and conclude that they tolerate anything thrown at them:

>>> urlparse('http:////::\\\\!!::!!++///')
ParseResult(scheme='http', netloc='', path='//::\\\\!!::!!++///',
params='', query='', fragment='')

>>> urlparse(os.urandom(32).decode('latin-1'))
ParseResult(scheme='', netloc='', path='\x7f¼â1gdä»6\x82', params='',
query='', fragment='\n\xadJ\x18+fli\x9cÛ\x9ak*ÄÅ\x02³F\x85Ç\x18')

Without studying the source code, it is impossible to know that there is a very narrow class of inputs on which they raise ValueError [2]:

>>> urlparse('http://[')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/urllib/parse.py", line 295, in urlparse
    splitresult = urlsplit(url, scheme, allow_fragments)
  File "/usr/lib/python3.5/urllib/parse.py", line 345, in urlsplit
    raise ValueError("Invalid IPv6 URL")
ValueError: Invalid IPv6 URL

This could be viewed as a documentation issue. But it could also be viewed as an implementation issue. Instead of raising ValueError on those square brackets, urlsplit could simply consider them *invalid* parts of an RFC 3986 reg-name, and lump them into netloc, as it already does with other *invalid* characters:

>>> urlparse('http://\0\0æí\n/')
ParseResult(scheme='http', netloc='\x00\x00æí\n', path='/', params='',
query='', fragment='')

Note that the raising behavior was introduced in Python 2.7/3.2.

See also issue 8721 [3].


[1] https://docs.python.org/3/library/urllib.parse.html
[2] https://github.com/python/cpython/blob/e32ec93/Lib/urllib/parse.py#L406-L408
[3] http://bugs.python.org/issue8721
msg288959 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-03-04 05:15
A note in the docs would be useful.  This API is far too well established to make any behavioral changes at this point.
msg291640 - (view) Author: Howie Benefiel (Howie Benefiel) * Date: 2017-04-14 05:07
I'm going to make a note in the documentation. I should have a PR for it in about 1 day.
msg293748 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2017-05-16 04:48
New changeset f6e863d868a621594df2a8abe072b5d4766e7137 by Senthil Kumaran (Howie Benefiel) in branch 'master':
 bpo-29651 - Cover edge case of square brackets in urllib docs (#1128)
https://github.com/python/cpython/commit/f6e863d868a621594df2a8abe072b5d4766e7137
msg293750 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2017-05-16 05:41
New changeset 72e5aa1ef812358b3b113e784e7365fec13dfd69 by Senthil Kumaran in branch '3.5':
 bpo-29651 - Cover edge case of square brackets in urllib docs (#1128) (#1597)
https://github.com/python/cpython/commit/72e5aa1ef812358b3b113e784e7365fec13dfd69
msg293751 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2017-05-16 05:41
New changeset 75b8a54bcad70806d9dcbbe20786f4d9092ab39c by Senthil Kumaran in branch '3.6':
 bpo-29651 - Cover edge case of square brackets in urllib docs (#1128) (#1596)
https://github.com/python/cpython/commit/75b8a54bcad70806d9dcbbe20786f4d9092ab39c
History
Date User Action Args
2022-04-11 14:58:43adminsetgithub: 73837
2017-05-16 05:50:14orsenthilsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2017-05-16 05:41:10orsenthilsetmessages: + msg293751
2017-05-16 05:41:05orsenthilsetmessages: + msg293750
2017-05-16 05:09:39orsenthilsetpull_requests: + pull_request1691
2017-05-16 05:01:30orsenthilsetpull_requests: + pull_request1690
2017-05-16 04:48:18orsenthilsetmessages: + msg293748
2017-04-15 00:23:13berker.peksagsetstage: needs patch -> patch review
versions: + Python 3.5
2017-04-14 05:38:02python-devsetpull_requests: + pull_request1263
2017-04-14 05:07:23Howie Benefielsetnosy: + Howie Benefiel
messages: + msg291640
2017-03-04 05:15:17rhettingersetnosy: + rhettinger
messages: + msg288959
2017-03-03 21:21:09terry.reedysetnosy: + orsenthil
stage: needs patch

versions: - Python 3.3, Python 3.4, Python 3.5
2017-02-25 19:45:27vfaronovcreate