Title: urlparse accepts invalid hostnames
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.4
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: daenney, orsenthil, r.david.murray, terry.reedy
Priority: normal Keywords:

Created on 2013-10-30 13:16 by daenney, last changed 2014-12-07 23:34 by terry.reedy. This issue is now closed.

Messages (5)
msg201730 - (view) Author: Daniele Sluijters (daenney) Date: 2013-10-30 13:16
Python 2's urlparse.urlparse() and Python 3's urllib.parse.urlparse() accept URI/URL's with underscores in the host/domain/subdomain. I believe this behaviour to be incorrect.

A distinction needs to be made between DNS names and Uniform Resource Locators and Identifiers, urlparse is supposed to deal with the latter (correct me if I'm wrong).

According to RFC 2181 section 11 on the syntax of DNS names the use of the underscore is allowed and in use around the internet, especially in TXT and SRV records.

However, RFC 1738 on Uniform Resource Locators section 3.1 (and its updates) always define the 'hostname' part of the URL as being:
Such a name consists of a sequence of domain labels separated by ".",
each domain label starting and ending with an alphanumeric character
and possibly also containing "-" characters.

On top of that, RFC 2396 on URI's section 3.2.2:
Hostnames take the form described in Section 3 of [RFC1034] and
Section 2.1 of [RFC1123]: a sequence of domain labels separated by
".", each domain label starting and ending with an alphanumeric
character and possibly also containing "-" characters.  

The underscore is never mentioned as being a valid character nor do any of the references in the RFC's as far as I've been able to see. 

Languages implementations vary:
 * Ruby URI.parse does not allow for underscores in domain labels.
 * Perl URI and URI::URL allow for underscores.
 * treats the underscore as an illegal character in the domain part.
 * org.apache.http.httphost since 4.2.3 treats the underscore as an illegal character in the domain part.

 * Apache: Seems to tolerate underscores but there's been a whole discussion about this on the mailing lists.
 * nginx: Matches a server_name of '_' to 'any invalid domain name'. It seems to accept server_names with underscores in them but the behaviour is currently unknown to me.

 * IE cannot write cookies since IE 5.5 if host or subdomain part includes an underscore.
 * Just about every other browser is fine with it.

Please note that I'm only talking about the host/domain/subdomain part of URI's and URL's, something like is perfectly valid and should parse.
msg201736 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-30 14:14
Python often defaults to the practical over the strictly-conforming (unless there is a 'strict' flag :)  We generally follow the lead of the browsers in implementing our web related modules.

The situation here appears to be a real mess.  Here's an interesting overview on the just the DNS question:

Given that changing this would be a backward incompatible change, I recommend closing this as won't fix.  I suspect the long term trend will be that everyone will eventually accept underscores, regardless of what the RFCs say.
msg201753 - (view) Author: Daniele Sluijters (daenney) Date: 2013-10-30 17:35
The link you mention only deals with the DNS side of things, this issue is specifically not about that, it's about the URI/URL side of things which is a very important distinction in this case.

I'm also not entirely sure I agree with the sentiment of "it's a mess anyway" so lets ignore the RFC. There's an RFC for a reason and if more implementations started to behave accordingly the mess would clear itself up instead of becoming even more of a nightmare.

I can agree with the practical over strict approach though.
msg201754 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-30 17:42
Yes, I said that link only dealt with the DNS side of things...where there are also incompatibilities.

I don't think that strictly adhering to the URI RFCs would clear things up.  What about those domains that have _s and want to run web services on them?  It appears that the current RFCs have no provision for handling that case.
msg201950 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-11-01 23:39
The 3.4 urllib.parse.urlparse doc says "The module has been designed to match the Internet RFC on Relative Uniform Resource Locators. It supports the following URL schemes: <list of 24, including 'file:'>".

To me, 'support' means 'accept every valid URL for the particular scheme' but not necessarily 'reject every URL that is invalid for the particular scheme'.

The other RFCs references are these: 
"Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’." and
" The fragment is now parsed for all URL schemes (unless allow_fragment is false), in accordance with RFC 3986."

I currently see this, at best, as a request to deprecate 'over-acceptance', to be removed in the future. But if there are urls in the wild that use _s, then practicality says that this should be closed as invalid.
Date User Action Args
2014-12-07 23:34:06terry.reedysetstatus: open -> closed
resolution: wont fix
stage: resolved
2013-11-01 23:39:40terry.reedysetversions: - Python 2.6, Python 3.1, Python 2.7, Python 3.2, Python 3.3, Python 3.5
nosy: + terry.reedy

messages: + msg201950

type: behavior -> enhancement
2013-10-30 17:42:18r.david.murraysetmessages: + msg201754
2013-10-30 17:35:39daenneysetmessages: + msg201753
2013-10-30 14:14:46r.david.murraysetnosy: + r.david.murray
messages: + msg201736
2013-10-30 13:16:02daenneycreate