This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author daenney
Recipients daenney, orsenthil
Date 2013-10-30.13:16:01
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1383138962.68.0.105445419369.issue19451@psf.upfronthosting.co.za>
In-reply-to
Content
Python 2's urlparse.urlparse() and Python 3's urllib.parse.urlparse() accept URI/URL's with underscores in the host/domain/subdomain. I believe this behaviour to be incorrect.

A distinction needs to be made between DNS names and Uniform Resource Locators and Identifiers, urlparse is supposed to deal with the latter (correct me if I'm wrong).

According to RFC 2181 section 11 on the syntax of DNS names the use of the underscore is allowed and in use around the internet, especially in TXT and SRV records.

However, RFC 1738 on Uniform Resource Locators section 3.1 (and its updates) always define the 'hostname' part of the URL as being:
Such a name consists of a sequence of domain labels separated by ".",
each domain label starting and ending with an alphanumeric character
and possibly also containing "-" characters.

On top of that, RFC 2396 on URI's section 3.2.2:
Hostnames take the form described in Section 3 of [RFC1034] and
Section 2.1 of [RFC1123]: a sequence of domain labels separated by
".", each domain label starting and ending with an alphanumeric
character and possibly also containing "-" characters.  

The underscore is never mentioned as being a valid character nor do any of the references in the RFC's as far as I've been able to see. 

Languages implementations vary:
 * Ruby URI.parse does not allow for underscores in domain labels.
 * Perl URI and URI::URL allow for underscores.
 * java.net.uri treats the underscore as an illegal character in the domain part.
 * org.apache.http.httphost since 4.2.3 treats the underscore as an illegal character in the domain part.

Httpd's:
 * Apache: Seems to tolerate underscores but there's been a whole discussion about this on the mailing lists.
 * nginx: Matches a server_name of '_' to 'any invalid domain name'. It seems to accept server_names with underscores in them but the behaviour is currently unknown to me.

Browsers:
 * IE cannot write cookies since IE 5.5 if host or subdomain part includes an underscore.
 * Just about every other browser is fine with it.

Please note that I'm only talking about the host/domain/subdomain part of URI's and URL's, something like http://en.wikipedia.org/wiki/12-hour_clock is perfectly valid and should parse.
History
Date User Action Args
2013-10-30 13:16:02daenneysetrecipients: + daenney, orsenthil
2013-10-30 13:16:02daenneysetmessageid: <1383138962.68.0.105445419369.issue19451@psf.upfronthosting.co.za>
2013-10-30 13:16:02daenneylinkissue19451 messages
2013-10-30 13:16:01daenneycreate