Message 288577 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vfaronov
Recipients	docs@python, vfaronov
Date	2017-02-25.19:45:26
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1488051927.22.0.367579438822.issue29651@psf.upfronthosting.co.za>
In-reply-to

Content
There is a problem with the standard library's urlsplit and urlparse functions, in Python 2.7 (module urlparse) and 3.2+ (module urllib.parse). The documentation for these functions [1] does not explain how they behave when given an invalid URL. One could try invoking them manually and conclude that they tolerate anything thrown at them: >>> urlparse('http:////::\\\\!!::!!++///') ParseResult(scheme='http', netloc='', path='//::\\\\!!::!!++///', params='', query='', fragment='') >>> urlparse(os.urandom(32).decode('latin-1')) ParseResult(scheme='', netloc='', path='\x7f¼â1gdä»6\x82', params='', query='', fragment='\n\xadJ\x18+fli\x9cÛ\x9akÄÅ\x02³F\x85Ç\x18') Without studying the source code, it is impossible to know that there is a very narrow class of inputs on which they raise ValueError [2]: >>> urlparse('http://[') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.5/urllib/parse.py", line 295, in urlparse splitresult = urlsplit(url, scheme, allow_fragments) File "/usr/lib/python3.5/urllib/parse.py", line 345, in urlsplit raise ValueError("Invalid IPv6 URL") ValueError: Invalid IPv6 URL This could be viewed as a documentation issue. But it could also be viewed as an implementation issue. Instead of raising ValueError on those square brackets, urlsplit could simply consider them invalid* parts of an RFC 3986 reg-name, and lump them into netloc, as it already does with other invalid characters: >>> urlparse('http://\0\0æí\n/') ParseResult(scheme='http', netloc='\x00\x00æí\n', path='/', params='', query='', fragment='') Note that the raising behavior was introduced in Python 2.7/3.2. See also issue 8721 [3]. [1] https://docs.python.org/3/library/urllib.parse.html [2] https://github.com/python/cpython/blob/e32ec93/Lib/urllib/parse.py#L406-L408 [3] http://bugs.python.org/issue8721

There is a problem with the standard library's urlsplit and urlparse functions, in Python 2.7 (module urlparse) and 3.2+ (module urllib.parse).

The documentation for these functions [1] does not explain how they behave when given an invalid URL.

One could try invoking them manually and conclude that they tolerate anything thrown at them:

>>> urlparse('http:////::\\\\!!::!!++///')
ParseResult(scheme='http', netloc='', path='//::\\\\!!::!!++///',
params='', query='', fragment='')

>>> urlparse(os.urandom(32).decode('latin-1'))
ParseResult(scheme='', netloc='', path='\x7f¼â1gdä»6\x82', params='',
query='', fragment='\n\xadJ\x18+fli\x9cÛ\x9ak*ÄÅ\x02³F\x85Ç\x18')

Without studying the source code, it is impossible to know that there is a very narrow class of inputs on which they raise ValueError [2]:

>>> urlparse('http://[')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/urllib/parse.py", line 295, in urlparse
    splitresult = urlsplit(url, scheme, allow_fragments)
  File "/usr/lib/python3.5/urllib/parse.py", line 345, in urlsplit
    raise ValueError("Invalid IPv6 URL")
ValueError: Invalid IPv6 URL

This could be viewed as a documentation issue. But it could also be viewed as an implementation issue. Instead of raising ValueError on those square brackets, urlsplit could simply consider them *invalid* parts of an RFC 3986 reg-name, and lump them into netloc, as it already does with other *invalid* characters:

>>> urlparse('http://\0\0æí\n/')
ParseResult(scheme='http', netloc='\x00\x00æí\n', path='/', params='',
query='', fragment='')

Note that the raising behavior was introduced in Python 2.7/3.2.

See also issue 8721 [3].


[1] https://docs.python.org/3/library/urllib.parse.html
[2] https://github.com/python/cpython/blob/e32ec93/Lib/urllib/parse.py#L406-L408
[3] http://bugs.python.org/issue8721

History
Date	User	Action	Args
2017-02-25 19:45:27	vfaronov	set	recipients: + vfaronov, docs@python
2017-02-25 19:45:27	vfaronov	set	messageid: <1488051927.22.0.367579438822.issue29651@psf.upfronthosting.co.za>
2017-02-25 19:45:27	vfaronov	link	issue29651 messages
2017-02-25 19:45:26	vfaronov	create