New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
urlparse fails at parsing "www.python.org:80/" #61136
Comments
Hello, ./python -c "from urlparse import urlparse ; print(urlparse('python.org:80/'))" (that is for 2.7, but the same happens on all the 3.x active branches). i'm attaching a test to expose this failure. |
Adding Senthil as per expert list |
This is not a bug: urlparse is there to parse URLs, and URLs start with an URL scheme such as "http:". There is no way for a generic URL parser to know that "python.org:80/" is supposed to be "http://python.org:80/". |
The documentation reports this example: >>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
ParseResult(scheme='', netloc='', path='www.cwi.nl:80/%7Eguido/Python.html',
params='', query='', fragment='') but when executing it returns: $ ./python -V
Python 2.7.3+
$ ./python -c "from urlparse import urlparse ; print urlparse('www.cwi.nl:80/%7Eguido/Python.html')"
ParseResult(scheme='www.cwi.nl', netloc='', path='80/%7Eguido/Python.html', params='', query='', fragment='') which doesn't match. |
Hmm, you're right. The behavior has been like this at least since Python 2.5: Python 2.5.4 (r254:67916, Dec 16 2012, 20:33:12)
[GCC 4.6.3] on linux3
Type "help", "copyright", "credits" or "license" for more information.
>>> from urlparse import urlparse
>>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
('www.cwi.nl', '', '80/%7Eguido/Python.html', '', '', '') The docs refer to RFC 1808. From a quick glance at the BNF in section 2.2, RFC 1808 allows dots in the scheme, but also allows ":" in the path. So there seems to be a parsing ambiguity, but see section 2.4.2: If the parse string contains a colon ":" after the first character That would indicate that the implementation is correct and the documentation should be fixed. Senthil? |
I am noticing this one late. Sorry for that. Give the doc example as: >>> urlparse('www.cwi.nl/%7Eguido/Python.html')
ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html', params='', query='', fragment='') Instead of
Which introduces a trick ":80" parsing and invokes the rule that Georg pointed out in the message. If I recollect, the point of the example was to point out that URLs (following 1808 RFC) should start with // for their netloc to be identified. Otherwise it is path. A ":" on PORT without the "scheme :" is really tricky for any application, so it is right thing for the parser to identify anything before ":" as scheme and the implementation here is correct. So, instead of fixing the example to identify the scheme as "www.cwi.nl" which is quite meaningless, the better way to fix the example will be, change the example to urlparse('www.cwi.nl/%7Eguido/Python.html') and the result remains the same. I am going ahead with the fix. Thanks. |
New changeset 33895c474b4d by Senthil Kumaran in branch '2.7': New changeset 5442a77b925c by Senthil Kumaran in branch '3.2': New changeset 8928205f57f6 by Senthil Kumaran in branch '3.3': New changeset 9caad461936e by Senthil Kumaran in branch 'default': |
I have fixed the docs issue. Thanks for the report and following up. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: