Issue 29651: Inconsistent/undocumented urlsplit/urlparse behavior on invalid inputs

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/73837

classification

Title:	Inconsistent/undocumented urlsplit/urlparse behavior on invalid inputs
Type:	behavior	Stage:	resolved
Components:	Documentation, Library (Lib)	Versions:	Python 3.7, Python 3.6, Python 3.5, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	Howie Benefiel, docs@python, orsenthil, rhettinger, vfaronov
Priority:	normal	Keywords:

Created on 2017-02-25 19:45 by vfaronov, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 1128	merged	python-dev, 2017-04-14 05:38
PR 1596	merged	orsenthil, 2017-05-16 05:01
PR 1597	merged	orsenthil, 2017-05-16 05:09

Messages (6)
msg288577 - (view)	Author: Vasiliy Faronov (vfaronov)	Date: 2017-02-25 19:45
There is a problem with the standard library's urlsplit and urlparse functions, in Python 2.7 (module urlparse) and 3.2+ (module urllib.parse). The documentation for these functions [1] does not explain how they behave when given an invalid URL. One could try invoking them manually and conclude that they tolerate anything thrown at them: >>> urlparse('http:////::\\\\!!::!!++///') ParseResult(scheme='http', netloc='', path='//::\\\\!!::!!++///', params='', query='', fragment='') >>> urlparse(os.urandom(32).decode('latin-1')) ParseResult(scheme='', netloc='', path='\x7f¼â1gdä»6\x82', params='', query='', fragment='\n\xadJ\x18+fli\x9cÛ\x9akÄÅ\x02³F\x85Ç\x18') Without studying the source code, it is impossible to know that there is a very narrow class of inputs on which they raise ValueError [2]: >>> urlparse('http://[') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.5/urllib/parse.py", line 295, in urlparse splitresult = urlsplit(url, scheme, allow_fragments) File "/usr/lib/python3.5/urllib/parse.py", line 345, in urlsplit raise ValueError("Invalid IPv6 URL") ValueError: Invalid IPv6 URL This could be viewed as a documentation issue. But it could also be viewed as an implementation issue. Instead of raising ValueError on those square brackets, urlsplit could simply consider them invalid* parts of an RFC 3986 reg-name, and lump them into netloc, as it already does with other invalid characters: >>> urlparse('http://\0\0æí\n/') ParseResult(scheme='http', netloc='\x00\x00æí\n', path='/', params='', query='', fragment='') Note that the raising behavior was introduced in Python 2.7/3.2. See also issue 8721 [3]. [1] https://docs.python.org/3/library/urllib.parse.html [2] https://github.com/python/cpython/blob/e32ec93/Lib/urllib/parse.py#L406-L408 [3] http://bugs.python.org/issue8721
msg288959 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2017-03-04 05:15
A note in the docs would be useful. This API is far too well established to make any behavioral changes at this point.
msg291640 - (view)	Author: Howie Benefiel (Howie Benefiel) *	Date: 2017-04-14 05:07
I'm going to make a note in the documentation. I should have a PR for it in about 1 day.
msg293748 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2017-05-16 04:48
New changeset f6e863d868a621594df2a8abe072b5d4766e7137 by Senthil Kumaran (Howie Benefiel) in branch 'master': bpo-29651 - Cover edge case of square brackets in urllib docs (#1128) https://github.com/python/cpython/commit/f6e863d868a621594df2a8abe072b5d4766e7137
msg293750 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2017-05-16 05:41
New changeset 72e5aa1ef812358b3b113e784e7365fec13dfd69 by Senthil Kumaran in branch '3.5': bpo-29651 - Cover edge case of square brackets in urllib docs (#1128) (#1597) https://github.com/python/cpython/commit/72e5aa1ef812358b3b113e784e7365fec13dfd69
msg293751 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2017-05-16 05:41
New changeset 75b8a54bcad70806d9dcbbe20786f4d9092ab39c by Senthil Kumaran in branch '3.6': bpo-29651 - Cover edge case of square brackets in urllib docs (#1128) (#1596) https://github.com/python/cpython/commit/75b8a54bcad70806d9dcbbe20786f4d9092ab39c

History
Date	User	Action	Args
2022-04-11 14:58:43	admin	set	github: 73837
2017-05-16 05:50:14	orsenthil	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2017-05-16 05:41:10	orsenthil	set	messages: + msg293751
2017-05-16 05:41:05	orsenthil	set	messages: + msg293750
2017-05-16 05:09:39	orsenthil	set	pull_requests: + pull_request1691
2017-05-16 05:01:30	orsenthil	set	pull_requests: + pull_request1690
2017-05-16 04:48:18	orsenthil	set	messages: + msg293748
2017-04-15 00:23:13	berker.peksag	set	stage: needs patch -> patch review versions: + Python 3.5
2017-04-14 05:38:02	python-dev	set	pull_requests: + pull_request1263
2017-04-14 05:07:23	Howie Benefiel	set	nosy: + Howie Benefiel messages: + msg291640
2017-03-04 05:15:17	rhettinger	set	nosy: + rhettinger messages: + msg288959
2017-03-03 21:21:09	terry.reedy	set	nosy: + orsenthil stage: needs patch versions: - Python 3.3, Python 3.4, Python 3.5
2017-02-25 19:45:27	vfaronov	create