classification
Title: urlparse example is wrong
Type: behavior Stage: resolved
Components: Documentation Versions: Python 3.1, Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: belopolsky, docs@python, eric.araujo, georg.brandl, orsenthil, r.david.murray
Priority: normal Keywords:

Created on 2010-10-29 05:26 by belopolsky, last changed 2010-11-08 02:37 by r.david.murray. This issue is now closed.

Messages (10)
msg119855 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-10-29 05:26
The following example in Doc/library/urlparse.rst is wrong

>>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
ParseResult(scheme='', netloc='', path='www.cwi.nl:80/%7Eguido/Python.html',
           params='', query='', fragment='')

In the actual output, scheme='www.cwi.nl'.

In addition, the preceding text is confusing and probably not grammatical:

"""
Otherwise, it is not possible to distinguish between netloc and path components, and would the indistinguishable component would be classified as the path as in a relative URL.
"""

Discovered while working on issue 10225.
msg119857 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-10-29 05:51
Looks like I've been beaten again by make doctest picking up older python, but something is not right here:

In Python 2.6.5:


>>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
ParseResult(scheme='www.cwi.nl', netloc='', path='80/%7Eguido/Python.html', params='', query='', fragment='')

but in 2.7:

>>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
ParseResult(scheme='', netloc='', path='www.cwi.nl:80/%7Eguido/Python.html', params='', query='', fragment='')


and the text preceding the example in the doc does not really tell which is right.
msg119859 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-10-29 06:15
I think this is correct: it is the new behavior after the fix for #754016 was committed.
msg119867 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-10-29 07:05
On Fri, Oct 29, 2010 at 2:15 AM, Georg Brandl <report@bugs.python.org> wrote:
..
> I think this is correct: it is the new behavior after the fix for #754016 was committed.
>

I agree.  I kept the issue open because I cannot parse

"""
Otherwise, it is not possible to distinguish between netloc and path
components, and would the indistinguishable component would be
classified as the path as in a relative URL.
"""
msg119868 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-10-29 07:06
That's for Senthil to rephrase as intended :)
msg119873 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-10-29 08:56
-   Otherwise, it is not possible to distinguish between netloc and path
-   components, and would the indistinguishable component would be classified
-   as the path as in a relative URL.
+   If the netloc does not start with '//', the module cannot distinguish it
+   from path and it would classify it as path component in the relative url.

How does this sound?
msg119914 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-10-29 16:10
// is not part of the netloc in RFC terms, it’s a delimiter between components
msg119991 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-30 14:51
How about this:

-  If the scheme value is not specified, urlparse following the syntax
-  specifications from RFC 1808, expects the netloc value to start with '//',
-  Otherwise, it is not possible to distinguish between net_loc and path
-  component and would classify the indistinguishable component as path as in
-  a relative url.

+  Following the syntax specifications in RFC 1808, urlparse recognizes
+  a netloc only if it is properly introduced by '//'.  Otherwise the
+  input must be presumed to be a relative URL and thus to start with
+  a path component.


However, it seems to me there is a bug here:

>>> urlparse.urlparse('www.k.com:80/path')
ParseResult(scheme='', netloc='', path='www.k.com:80/path', params='',
query='', fragment='')
>>> urlparse.urlparse('www.k.com:path')
ParseResult(scheme='www.k.com', netloc='', path='path', params='',
query='', fragment='')

I think the second one is correct and that the first one should produce

ParseResult(scheme='www.k.com', netloc='', path='80/path', params='',
query='', fragment='')

I haven't read all the way through the RFC again, though.  But *one*
of the above is wrong.
msg120678 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-11-07 13:21
Fixed the wordings in r86296(py3k), r86297(release31-maint) and r86298(release27-maint).

David, for the examples you mentioned, the first one's parsing logic follows the explanation that is written. It is correct.

For the second example, the port value not being a DIGIT exhibits such a behavior.  I am unable to recollect the reason for this behavior. 
Either the URL is invalid (PORT is not a DIGIT, and parse module is simply ignoring to raise an error - it's okay, given the input is invalid) or it needs to distinguish the ':' as a port separator from path separator for some valid urls.

I think, if we find a better reason to change something for the second scenario, we shall address that.
msg120710 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-11-08 02:37
Senthil, no it isn't.  There is no way to know a priori that ':80' represents a port number rather than a path, absent the // introducer for the netloc.

This bug is fixed; I ought to open a new one for the path thing but perhaps I will wait for a user report instead :)
History
Date User Action Args
2010-11-08 02:37:27r.david.murraysetmessages: + msg120710
2010-11-07 13:21:58orsenthilsetstatus: open -> closed
type: behavior
messages: + msg120678

resolution: fixed
stage: resolved
2010-10-30 14:51:17r.david.murraysetnosy: + r.david.murray
messages: + msg119991
2010-10-29 16:10:29eric.araujosetnosy: + eric.araujo
messages: + msg119914
2010-10-29 08:56:04orsenthilsetmessages: + msg119873
2010-10-29 07:06:12georg.brandlsetmessages: + msg119868
2010-10-29 07:05:13belopolskysetmessages: + msg119867
2010-10-29 06:15:06georg.brandlsetnosy: + georg.brandl
messages: + msg119859
2010-10-29 05:51:24belopolskysetmessages: + msg119857
2010-10-29 05:32:19georg.brandlsetassignee: docs@python -> orsenthil

nosy: + orsenthil
2010-10-29 05:26:04belopolskycreate