Message 350663 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	maggyero
Recipients	Jeremy.Hylton, maggyero, orsenthil
Date	2019-08-28.14:54:51
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1567004091.58.0.0475239472757.issue37969@roundup.psfhosted.org>
In-reply-to

Content
The Python library documentation of the `urllib.parse.urlunparse <https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlunparse>`_ and `urllib.parse.urlunsplit <https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlunsplit>`_ functions states: This may result in a slightly different, but equivalent URL, if the URL that was parsed originally had unnecessary delimiters (for example, a ? with an empty query; the RFC states that these are equivalent). So with the <http://example.com/?> URI:: >>> import urllib.parse >>> urllib.parse.urlunparse(urllib.parse.urlparse("http://example.com/?")) 'http://example.com/' >>> urllib.parse.urlunsplit(urllib.parse.urlsplit("http://example.com/?")) 'http://example.com/' But `RFC 3986 <https://tools.ietf.org/html/rfc3986?#section-6.2.3>`_ states the exact opposite: Normalization should not remove delimiters when their associated component is empty unless licensed to do so by the scheme specification. For example, the URI "http://example.com/?" cannot be assumed to be equivalent to any of the examples above. Likewise, the presence or absence of delimiters within a userinfo subcomponent is usually significant to its interpretation. The fragment component is not subject to any scheme-based normalization; thus, two URIs that differ only by the suffix "#" are considered different regardless of the scheme. So maybe `urllib.parse.urlunparse` ∘ `urllib.parse.urlparse` and `urllib.parse.urlunsplit` ∘ `urllib.parse.urlsplit` are not supposed to be used for `syntax-based normalization <https://tools.ietf.org/html/rfc3986?#section-6>`_ of URIs. But still, both `urllib.parse.urlparse` or `urllib.parse.urlsplit` lose the "delimiter + empty component" information of the URI string, so they report false equivalent URIs:: >>> import urllib.parse >>> urllib.parse.urlparse("http://example.com/?") == urllib.parse.urlparse("http://example.com/") True >>> urllib.parse.urlsplit("http://example.com/?") == urllib.parse.urlsplit("http://example.com/") True P.-S. — Is there a syntax-based normalization function of URIs in the Python library?

The Python library documentation of the `urllib.parse.urlunparse <https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlunparse>`_ and `urllib.parse.urlunsplit <https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlunsplit>`_ functions states:

    This may result in a slightly different, but equivalent URL, if the URL that was parsed originally had unnecessary delimiters (for example, a ? with an empty query; the RFC states that these are equivalent).

So with the <http://example.com/?> URI::

    >>> import urllib.parse
    >>> urllib.parse.urlunparse(urllib.parse.urlparse("http://example.com/?"))
    'http://example.com/'
    >>> urllib.parse.urlunsplit(urllib.parse.urlsplit("http://example.com/?"))
    'http://example.com/'

But `RFC 3986 <https://tools.ietf.org/html/rfc3986?#section-6.2.3>`_ states the exact opposite:

    Normalization should not remove delimiters when their associated component is empty unless licensed to do so by the scheme specification.  For example, the URI "http://example.com/?" cannot be assumed to be equivalent to any of the examples above.  Likewise, the presence or absence of delimiters within a userinfo subcomponent is usually significant to its interpretation.  The fragment component is not subject to any scheme-based normalization; thus, two URIs that differ only by the suffix "#" are considered different regardless of the scheme.

So maybe `urllib.parse.urlunparse` ∘ `urllib.parse.urlparse` and `urllib.parse.urlunsplit` ∘ `urllib.parse.urlsplit` are not supposed to be used for `syntax-based normalization <https://tools.ietf.org/html/rfc3986?#section-6>`_ of URIs. But still, both `urllib.parse.urlparse` or `urllib.parse.urlsplit` lose the "delimiter + empty component" information of the URI string, so they report false equivalent URIs::

    >>> import urllib.parse
    >>> urllib.parse.urlparse("http://example.com/?") == urllib.parse.urlparse("http://example.com/")
    True
    >>> urllib.parse.urlsplit("http://example.com/?") == urllib.parse.urlsplit("http://example.com/")
    True

P.-S. — Is there a syntax-based normalization function of URIs in the Python library?

History
Date	User	Action	Args
2019-08-28 14:54:51	maggyero	set	recipients: + maggyero, orsenthil, Jeremy.Hylton
2019-08-28 14:54:51	maggyero	set	messageid: <1567004091.58.0.0475239472757.issue37969@roundup.psfhosted.org>
2019-08-28 14:54:51	maggyero	link	issue37969 messages
2019-08-28 14:54:51	maggyero	create