Message 179712 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	georg.brandl
Recipients	georg.brandl, orsenthil, sandro.tosi
Date	2013-01-11.17:54:18
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1357926859.42.0.981504662198.issue16932@psf.upfronthosting.co.za>
In-reply-to

Content
Hmm, you're right. The behavior has been like this at least since Python 2.5: Python 2.5.4 (r254:67916, Dec 16 2012, 20:33:12) [GCC 4.6.3] on linux3 Type "help", "copyright", "credits" or "license" for more information. >>> from urlparse import urlparse >>> urlparse('www.cwi.nl:80/%7Eguido/Python.html') ('www.cwi.nl', '', '80/%7Eguido/Python.html', '', '', '') The docs refer to RFC 1808. From a quick glance at the BNF in section 2.2, RFC 1808 allows dots in the scheme, but also allows ":" in the path. So there seems to be a parsing ambiguity, but see section 2.4.2: If the parse string contains a colon ":" after the first character and before any characters not allowed as part of a scheme name (i.e., any not an alphanumeric, plus "+", period ".", or hyphen "-"), the <scheme> of the URL is the substring of characters up to but not including the first colon. These characters and the colon are then removed from the parse string before continuing. That would indicate that the implementation is correct and the documentation should be fixed. Senthil?

Hmm, you're right.  The behavior has been like this at least since Python 2.5:

Python 2.5.4 (r254:67916, Dec 16 2012, 20:33:12) 
[GCC 4.6.3] on linux3
Type "help", "copyright", "credits" or "license" for more information.
>>> from urlparse import urlparse
>>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
('www.cwi.nl', '', '80/%7Eguido/Python.html', '', '', '')

The docs refer to RFC 1808.  From a quick glance at the BNF in section 2.2, RFC 1808 allows dots in the scheme, but also allows ":" in the path.  So there seems to be a parsing ambiguity, but see section 2.4.2:

   If the parse string contains a colon ":" after the first character
   and before any characters not allowed as part of a scheme name (i.e.,
   any not an alphanumeric, plus "+", period ".", or hyphen "-"), the
   <scheme> of the URL is the substring of characters up to but not
   including the first colon.  These characters and the colon are then
   removed from the parse string before continuing.

That would indicate that the implementation is correct and the documentation should be fixed. Senthil?

History
Date	User	Action	Args
2013-01-11 17:54:19	georg.brandl	set	recipients: + georg.brandl, orsenthil, sandro.tosi
2013-01-11 17:54:19	georg.brandl	set	messageid: <1357926859.42.0.981504662198.issue16932@psf.upfronthosting.co.za>
2013-01-11 17:54:19	georg.brandl	link	issue16932 messages
2013-01-11 17:54:18	georg.brandl	create