classification
Title: urlparse fails at parsing "www.python.org:80/"
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2, Python 3.3, Python 3.4, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: asvetlov, georg.brandl, miss-islington, orsenthil, python-dev, sandro.tosi
Priority: normal Keywords:

Created on 2013-01-11 11:29 by sandro.tosi, last changed 2019-10-18 15:23 by orsenthil. This issue is now closed.

Files
File name Uploaded Description Edit
urlparse.diff sandro.tosi, 2013-01-11 11:29
Pull Requests
URL Status Linked Edit
PR 661 merged Tim.Graham, 2017-03-13 17:39
PR 16837 merged miss-islington, 2019-10-18 13:07
PR 16839 merged orsenthil, 2019-10-18 13:51
Messages (11)
msg179670 - (view) Author: Sandro Tosi (sandro.tosi) * (Python committer) Date: 2013-01-11 11:29
Hello,
as reported at http://mail.python.org/pipermail/docs/2013-January/012375.html urlparse fails to parse URLs without a schema and with a url path, as opposed to what's documented at http://docs.python.org/2/library/urlparse.html?highlight=urlparse#urlparse :

./python -c "from urlparse import urlparse ; print(urlparse('python.org:80/'))"
ParseResult(scheme='python.org', netloc='', path='80/', params='', query='', fragment='')

(that is for 2.7, but the same happens on all the 3.x active branches).

i'm attaching a test to expose this failure.
msg179672 - (view) Author: Sandro Tosi (sandro.tosi) * (Python committer) Date: 2013-01-11 11:36
Adding Senthil as per expert list
msg179702 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2013-01-11 16:41
This is not a bug: urlparse is there to parse URLs, and URLs start with an URL scheme such as "http:".

There is no way for a generic URL parser to know that "python.org:80/" is supposed to be "http://python.org:80/".
msg179709 - (view) Author: Sandro Tosi (sandro.tosi) * (Python committer) Date: 2013-01-11 17:35
The documentation reports this example:

>>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
ParseResult(scheme='', netloc='', path='www.cwi.nl:80/%7Eguido/Python.html',
           params='', query='', fragment='')

but when executing it returns:

$ ./python -V
Python 2.7.3+
$ ./python -c "from urlparse import urlparse ; print urlparse('www.cwi.nl:80/%7Eguido/Python.html')"
ParseResult(scheme='www.cwi.nl', netloc='', path='80/%7Eguido/Python.html', params='', query='', fragment='')

which doesn't match.
msg179712 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2013-01-11 17:54
Hmm, you're right.  The behavior has been like this at least since Python 2.5:

Python 2.5.4 (r254:67916, Dec 16 2012, 20:33:12) 
[GCC 4.6.3] on linux3
Type "help", "copyright", "credits" or "license" for more information.
>>> from urlparse import urlparse
>>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
('www.cwi.nl', '', '80/%7Eguido/Python.html', '', '', '')

The docs refer to RFC 1808.  From a quick glance at the BNF in section 2.2, RFC 1808 allows dots in the scheme, but also allows ":" in the path.  So there seems to be a parsing ambiguity, but see section 2.4.2:

   If the parse string contains a colon ":" after the first character
   and before any characters not allowed as part of a scheme name (i.e.,
   any not an alphanumeric, plus "+", period ".", or hyphen "-"), the
   <scheme> of the URL is the substring of characters up to but not
   including the first colon.  These characters and the colon are then
   removed from the parse string before continuing.

That would indicate that the implementation is correct and the documentation should be fixed. Senthil?
msg183029 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2013-02-26 08:42
I am noticing this one late. Sorry for that.
I agree that this is docs issue and I would like to fix it in this way.

Give the doc example as:

>>> urlparse('www.cwi.nl/%7Eguido/Python.html')
ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html', params='', query='', fragment='')

Instead of

>>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')

Which introduces a trick ":80" parsing and invokes the rule that Georg pointed out in the message. If I recollect, the point of the example was to point out that URLs (following 1808 RFC) should start with // for their netloc to be identified. Otherwise it is path.

A ":" on PORT without the "scheme :" is really tricky for any application, so it is right thing for the parser to identify anything before ":" as scheme and the implementation here is correct.

So, instead of fixing the example to identify the scheme as "www.cwi.nl" which is quite meaningless, the better way to fix the example will be, change the example to urlparse('www.cwi.nl/%7Eguido/Python.html') and the result remains the same.

I am going ahead with the fix. Thanks.
msg183035 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-02-26 09:02
New changeset 33895c474b4d by Senthil Kumaran in branch '2.7':
Fix issue16932: Fix the urlparse example. Remote :port when scheme is not specified to demonstrate correct behavior
http://hg.python.org/cpython/rev/33895c474b4d

New changeset 5442a77b925c by Senthil Kumaran in branch '3.2':
Fix issue16932: Fix the urlparse example. Remote :port when scheme is not specified to demonstrate correct behavior
http://hg.python.org/cpython/rev/5442a77b925c

New changeset 8928205f57f6 by Senthil Kumaran in branch '3.3':
Fix issue16932: Fix the urlparse example. Remote :port when scheme is not specified to demonstrate correct behavior
http://hg.python.org/cpython/rev/8928205f57f6

New changeset 9caad461936e by Senthil Kumaran in branch 'default':
Fix issue16932: Fix the urlparse example. Remote :port when scheme is not specified to demonstrate correct behavior
http://hg.python.org/cpython/rev/9caad461936e
msg183036 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2013-02-26 09:03
I have fixed the docs issue. Thanks for the report and following up.
msg354891 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2019-10-18 13:07
New changeset 5a88d50ff013a64fbdb25b877c87644a9034c969 by Senthil Kumaran (Tim Graham) in branch 'master':
bpo-27657: Fix urlparse() with numeric paths (#661)
https://github.com/python/cpython/commit/5a88d50ff013a64fbdb25b877c87644a9034c969
msg354896 - (view) Author: miss-islington (miss-islington) Date: 2019-10-18 13:24
New changeset 82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f by Miss Islington (bot) in branch '3.7':
bpo-27657: Fix urlparse() with numeric paths (GH-661)
https://github.com/python/cpython/commit/82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f
msg354905 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2019-10-18 15:23
New changeset 0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384 by Senthil Kumaran in branch '3.8':
[3.8] bpo-27657: Fix urlparse() with numeric paths (GH-661) (#16839)
https://github.com/python/cpython/commit/0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384
History
Date User Action Args
2019-10-18 15:23:21orsenthilsetmessages: + msg354905
2019-10-18 13:51:49orsenthilsetpull_requests: + pull_request16390
2019-10-18 13:24:32miss-islingtonsetnosy: + miss-islington
messages: + msg354896
2019-10-18 13:07:37miss-islingtonsetpull_requests: + pull_request16384
2019-10-18 13:07:36orsenthilsetmessages: + msg354891
2017-03-13 17:39:32Tim.Grahamsetpull_requests: + pull_request542
2013-02-26 09:03:53orsenthilsetstatus: open -> closed

assignee: orsenthil
keywords: - buildbot
messages: + msg183036
type: behavior
resolution: fixed
stage: needs patch -> resolved
2013-02-26 09:02:27python-devsetnosy: + python-dev
messages: + msg183035
2013-02-26 08:42:18orsenthilsetresolution: not a bug -> (no value)
messages: + msg183029
2013-01-14 17:34:40asvetlovsetnosy: + asvetlov
2013-01-11 17:54:19georg.brandlsetkeywords: + buildbot, - patch
status: closed -> open
messages: + msg179712
2013-01-11 17:35:52sandro.tosisetmessages: + msg179709
2013-01-11 16:41:23georg.brandlsetstatus: open -> closed

nosy: + georg.brandl
messages: + msg179702

resolution: not a bug
2013-01-11 11:36:47sandro.tosisetnosy: + orsenthil
messages: + msg179672
2013-01-11 11:29:29sandro.tosicreate