classification
Title: urlparse.urlparse() parses invalid URI without generating an error (examples provided)
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: amrith, eric.smith, martin.panter, orsenthil, r.david.murray, serhiy.storchaka
Priority: normal Keywords:

Created on 2016-11-30 15:57 by amrith, last changed 2016-12-01 04:40 by r.david.murray.

Messages (8)
msg282086 - (view) Author: Amrith Kumar (amrith) Date: 2016-11-30 15:57
The urlparse library incorrectly accepts a URI which specifies an invalid host identifier.

Example:

http://www.example.com:/abc

Looking at the URI specifications, this is an invalid URI.

By definition, you are supposed to specify a "hostport" and a hostport is defined as:

https://www.w3.org/Addressing/URL/uri-spec.html

hostport
    host [ : port ]

The BNF indicates that : is only valid if a port is also specified.

See current behavior; I submit to you that this should generate an exception.

https://gist.github.com/anonymous/8504f160ff90649890b5a2a82f8028b0
msg282116 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2016-11-30 23:37
Can you please paste your code into a comment on this issue? Gist content has a habit of vanishing. Thanks!
msg282117 - (view) Author: Amrith Kumar (amrith) Date: 2016-11-30 23:47
As requested ...

>>> urlparse.urlparse('http://www.google.com:/abc')
ParseResult(scheme='http', netloc='www.google.com:', path='/abc', params='', query='', fragment='')

I submit to you that this should generate an error.
msg282119 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-12-01 00:31
The Python documentation refers to RFC 3986, which allows an empty port string introduced by a colon, although it recommends against it: <https://tools.ietf.org/html/rfc3986#section-3.2.3>

The “port” subcomponent of “authority” is designated by an optional port number in decimal following the host and delimited from it by a single colon. . . . URI producers and normalizers should omit the port component and its “:” delimiter if “port” is empty . . .

What problem are you trying to solve by raising an error?

See also Issue 20059, where accessing the “port” field now raises ValueError for out-of-range values, and Issue 20271, discussing invalid URL handling more generally.
msg282120 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-12-01 00:33
Also, this is in direct contradiction to Issue 20270.
msg282122 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2016-12-01 00:41
And note that there are tests that explicitly check that the colon with no port works (via issue 20270).

Given that this behavior has been around for a while, and is explicitly allowed, I would recommend against changing this to an error.
msg282129 - (view) Author: Amrith Kumar (amrith) Date: 2016-12-01 02:33
Eric, Martin,

I'm sure this is an interpretation (and I'll be fair and give you both) that a URL of the form:

http://hostname.domain.tld:/path

should be held invalid. The rationale is per the section of the spec you quoted.

The other side of the argument is that the hostname and port are defined as:

hostname [ : port ] 

where port is defined as

*DIGIT

This implies that 0 digits is also valid.

I submit to you that the ambiguity would be removed if they:

- removed the paragraph telling people not to emit a : if there was no port, or
- changing the port definition to 1*DIGIT

Absent that, I believe that the paragraph implies that the intent was 1*DIGIT.

And therefore a : with no following number should generate an error.

I raise this because the behavior of urlparse() (the way it does things now) is being cited as the reason why other routines in OpenStack should handle this when the reason we were getting URL's with a : and no port was in fact an oversight.

So, in hearing the rationalization as being that "urlparse() does it, so ..." I figured I'd come to the source and see if we could make urlparse() do things unambiguously.

Now, if the reason it does this is because someone who came before me made the argument that a : with no port should be accepted, I guess I'm out of luck.
msg282135 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-12-01 04:40
Yes, I'm afraid so.  But the rationale is presumably the same as it is in many RFCs: you should accept broken stuff if it can be interpreted unambiguously, but you should not *produce* the broken stuff.  So I'd say the RFC is correct as quoted.  It is then up to openstack whether or not it wants to accept such URLs in a given interface, and it if were me I'd base that decision on whether it was an 'internal' interface or an 'external' one.  But you can also argue that even an external interface could be strict, depending on the application domain of the interface.  urllib, on the other hand, needs to accept it, and we can't change it at this point for backward compatibility reasons if nothing else.

Based on the RFC, one could argue that urlunsplit should omit the colon.  That could break backward compatibility, too, though will a lot less likelyhood.  So we could still fix it in 3.7 if we decide we should.
History
Date User Action Args
2016-12-01 04:40:56r.david.murraysetnosy: + r.david.murray
messages: + msg282135
2016-12-01 02:33:50amrithsetmessages: + msg282129
2016-12-01 00:41:28eric.smithsetmessages: + msg282122
2016-12-01 00:33:37martin.pantersetnosy: + serhiy.storchaka
messages: + msg282120
2016-12-01 00:31:11martin.pantersetnosy: + martin.panter
messages: + msg282119
2016-11-30 23:47:12amrithsetmessages: + msg282117
2016-11-30 23:37:26eric.smithsetnosy: + eric.smith
messages: + msg282116
2016-11-30 17:31:20SilentGhostsetnosy: + orsenthil

versions: + Python 3.5, Python 3.6, Python 3.7, - Python 3.4
2016-11-30 15:57:05amrithcreate