Issue 37969: Correct urllib.parse functions dropping the delimiters of empty URI components

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/82150

classification

Title:	Correct urllib.parse functions dropping the delimiters of empty URI components
Type:	behavior	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Jeremy.Hylton, maggyero, nicktimko, op368, orsenthil
Priority:	normal	Keywords:	patch

Created on 2019-08-28 14:54 by maggyero, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 15642	open	maggyero, 2019-09-02 12:15

Messages (4)
msg350663 - (view)	Author: Géry (maggyero) *	Date: 2019-08-28 14:54
The Python library documentation of the `urllib.parse.urlunparse <https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlunparse>`_ and `urllib.parse.urlunsplit <https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlunsplit>`_ functions states: This may result in a slightly different, but equivalent URL, if the URL that was parsed originally had unnecessary delimiters (for example, a ? with an empty query; the RFC states that these are equivalent). So with the <http://example.com/?> URI:: >>> import urllib.parse >>> urllib.parse.urlunparse(urllib.parse.urlparse("http://example.com/?")) 'http://example.com/' >>> urllib.parse.urlunsplit(urllib.parse.urlsplit("http://example.com/?")) 'http://example.com/' But `RFC 3986 <https://tools.ietf.org/html/rfc3986?#section-6.2.3>`_ states the exact opposite: Normalization should not remove delimiters when their associated component is empty unless licensed to do so by the scheme specification. For example, the URI "http://example.com/?" cannot be assumed to be equivalent to any of the examples above. Likewise, the presence or absence of delimiters within a userinfo subcomponent is usually significant to its interpretation. The fragment component is not subject to any scheme-based normalization; thus, two URIs that differ only by the suffix "#" are considered different regardless of the scheme. So maybe `urllib.parse.urlunparse` ∘ `urllib.parse.urlparse` and `urllib.parse.urlunsplit` ∘ `urllib.parse.urlsplit` are not supposed to be used for `syntax-based normalization <https://tools.ietf.org/html/rfc3986?#section-6>`_ of URIs. But still, both `urllib.parse.urlparse` or `urllib.parse.urlsplit` lose the "delimiter + empty component" information of the URI string, so they report false equivalent URIs:: >>> import urllib.parse >>> urllib.parse.urlparse("http://example.com/?") == urllib.parse.urlparse("http://example.com/") True >>> urllib.parse.urlsplit("http://example.com/?") == urllib.parse.urlsplit("http://example.com/") True P.-S. — Is there a syntax-based normalization function of URIs in the Python library?
msg350687 - (view)	Author: Nick Timkovich (nicktimko) *	Date: 2019-08-28 18:59
Looking at the history, the line in the docs used to say > ... (for example, an empty query (the draft states that these are equivalent). which was changed to "the RFC" in April 2006 https://github.com/python/cpython/commit/ad5177cf8da#diff-5b4cef771c997754f9e2feeae11d3b1eL68-R95 The original language was added in February 1995 https://github.com/python/cpython/commit/a12ef9433baf#diff-5b4cef771c997754f9e2feeae11d3b1eR48-R51 So "the draft" probably meant the draft of RFC-1738 https://tools.ietf.org/html/rfc1738#section-3.3 which is kinda vague on it. It didn't help that rewording it as "the RFC" later when there are 3+ RFCs referenced in the lib docs, one of which obsoleted the another RFC and definitely changed the meaning of the loose "?". The draft of 2396 always seemed to have the opposite wording you point out, at least back in draft 07 (September 2004): https://tools.ietf.org/html/draft-fielding-uri-rfc2396bis-07#section-6.2.3 The draft 06 (April 2004) was silent on the matter https://tools.ietf.org/html/draft-fielding-uri-rfc2396bis-06#section-6.2.3
msg351043 - (view)	Author: Géry (maggyero) *	Date: 2019-09-02 22:43
@nicktimko Thanks for the historical track. Here is a patch that solves this issue by updating the `urlsplit` and `urlunsplit` functions of the `urllib.parse` module to keep the '?' and '#' delimiters in URIs if present, even if their associated component is empty, as required by RFC 3986: https://github.com/python/cpython/pull/15642 That way we get the correct behavior: >>> import urllib.parse >>> urllib.parse.urlunsplit(urllib.parse.urlsplit("http://example.com/?")) 'http://example.com/?' >>> urllib.parse.urlunsplit(urllib.parse.urlsplit("http://example.com/#")) 'http://example.com/#' Any feedback welcome.
msg371180 - (view)	Author: Open Close (op368) *	Date: 2020-06-10 11:21
This is a duplicate of issue22852 ('urllib.parse wrongly strips empty #fragment, ?query, //netloc'). Also note that three alternative solutions have already proposed. (1) Add 'None' type to Result objects members like this one. But it is considering not only query and fragment, but also netloc, which may solve many other issues. (2) Add 'has_netloc', 'has_query' and 'has_fragment' attribute. (3) like (1), but conditional on 'allow_none' argument (similar to 'allow_fragments')

History
Date	User	Action	Args
2022-04-11 14:59:19	admin	set	github: 82150
2020-06-10 11:21:47	op368	set	nosy: + op368 messages: + msg371180
2019-09-11 06:13:07	maggyero	set	title: urllib.parse functions reporting false equivalent URIs -> Correct urllib.parse functions dropping the delimiters of empty URI components
2019-09-02 22:43:24	maggyero	set	messages: + msg351043
2019-09-02 12:15:07	maggyero	set	keywords: + patch stage: patch review pull_requests: + pull_request15308
2019-08-28 18:59:13	nicktimko	set	nosy: + nicktimko messages: + msg350687
2019-08-28 14:54:51	maggyero	create