urlparse fails at parsing "www.python.org:80/" #61136

sandrotosi · 2013-01-11T11:29:29Z

BPO	16932
Nosy	@birkenfeld, @orsenthil, @asvetlov, @sandrotosi, @miss-islington
PRs	bpo-27657: Fix urlparse() with numeric paths #661 [3.7] bpo-27657: Fix urlparse() with numeric paths (GH-661) #16837 [3.8] bpo-27657: Fix urlparse() with numeric paths (GH-661) #16839
Files	urlparse.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/orsenthil'
closed_at = <Date 2013-02-26.09:03:53.168>
created_at = <Date 2013-01-11.11:29:28.972>
labels = ['type-bug', 'library']
title = 'urlparse fails at parsing "www.python.org:80/"'
updated_at = <Date 2019-10-18.15:23:21.811>
user = 'https://github.com/sandrotosi'

bugs.python.org fields:

activity = <Date 2019-10-18.15:23:21.811>
actor = 'orsenthil'
assignee = 'orsenthil'
closed = True
closed_date = <Date 2013-02-26.09:03:53.168>
closer = 'orsenthil'
components = ['Library (Lib)']
creation = <Date 2013-01-11.11:29:28.972>
creator = 'sandro.tosi'
dependencies = []
files = ['28692']
hgrepos = []
issue_num = 16932
keywords = []
message_count = 11.0
messages = ['179670', '179672', '179702', '179709', '179712', '183029', '183035', '183036', '354891', '354896', '354905']
nosy_count = 6.0
nosy_names = ['georg.brandl', 'orsenthil', 'asvetlov', 'sandro.tosi', 'python-dev', 'miss-islington']
pr_nums = ['661', '16837', '16839']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue16932'
versions = ['Python 2.7', 'Python 3.2', 'Python 3.3', 'Python 3.4']

sandrotosi · 2013-01-11T11:29:28Z

Hello,
as reported at http://mail.python.org/pipermail/docs/2013-January/012375.html urlparse fails to parse URLs without a schema and with a url path, as opposed to what's documented at http://docs.python.org/2/library/urlparse.html?highlight=urlparse#urlparse :

./python -c "from urlparse import urlparse ; print(urlparse('python.org:80/'))"
ParseResult(scheme='python.org', netloc='', path='80/', params='', query='', fragment='')

(that is for 2.7, but the same happens on all the 3.x active branches).

i'm attaching a test to expose this failure.

sandrotosi · 2013-01-11T11:36:47Z

Adding Senthil as per expert list

birkenfeld · 2013-01-11T16:41:24Z

This is not a bug: urlparse is there to parse URLs, and URLs start with an URL scheme such as "http:".

There is no way for a generic URL parser to know that "python.org:80/" is supposed to be "http://python.org:80/".

sandrotosi · 2013-01-11T17:35:53Z

The documentation reports this example:

>>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
ParseResult(scheme='', netloc='', path='www.cwi.nl:80/%7Eguido/Python.html',
           params='', query='', fragment='')

but when executing it returns:

$ ./python -V
Python 2.7.3+
$ ./python -c "from urlparse import urlparse ; print urlparse('www.cwi.nl:80/%7Eguido/Python.html')"
ParseResult(scheme='www.cwi.nl', netloc='', path='80/%7Eguido/Python.html', params='', query='', fragment='')

which doesn't match.

birkenfeld · 2013-01-11T17:54:18Z

Hmm, you're right. The behavior has been like this at least since Python 2.5:

Python 2.5.4 (r254:67916, Dec 16 2012, 20:33:12) 
[GCC 4.6.3] on linux3
Type "help", "copyright", "credits" or "license" for more information.
>>> from urlparse import urlparse
>>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
('www.cwi.nl', '', '80/%7Eguido/Python.html', '', '', '')

The docs refer to RFC 1808. From a quick glance at the BNF in section 2.2, RFC 1808 allows dots in the scheme, but also allows ":" in the path. So there seems to be a parsing ambiguity, but see section 2.4.2:

If the parse string contains a colon ":" after the first character
and before any characters not allowed as part of a scheme name (i.e.,
any not an alphanumeric, plus "+", period ".", or hyphen "-"), the
<scheme> of the URL is the substring of characters up to but not
including the first colon. These characters and the colon are then
removed from the parse string before continuing.

That would indicate that the implementation is correct and the documentation should be fixed. Senthil?

orsenthil · 2013-02-26T08:42:18Z

I am noticing this one late. Sorry for that.
I agree that this is docs issue and I would like to fix it in this way.

Give the doc example as:

>>> urlparse('www.cwi.nl/%7Eguido/Python.html')
ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html', params='', query='', fragment='')

Instead of

>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')

Which introduces a trick ":80" parsing and invokes the rule that Georg pointed out in the message. If I recollect, the point of the example was to point out that URLs (following 1808 RFC) should start with // for their netloc to be identified. Otherwise it is path.

A ":" on PORT without the "scheme :" is really tricky for any application, so it is right thing for the parser to identify anything before ":" as scheme and the implementation here is correct.

So, instead of fixing the example to identify the scheme as "www.cwi.nl" which is quite meaningless, the better way to fix the example will be, change the example to urlparse('www.cwi.nl/%7Eguido/Python.html') and the result remains the same.

I am going ahead with the fix. Thanks.

python-dev · 2013-02-26T09:02:28Z

New changeset 33895c474b4d by Senthil Kumaran in branch '2.7':
Fix bpo-16932: Fix the urlparse example. Remote :port when scheme is not specified to demonstrate correct behavior
http://hg.python.org/cpython/rev/33895c474b4d

New changeset 5442a77b925c by Senthil Kumaran in branch '3.2':
Fix bpo-16932: Fix the urlparse example. Remote :port when scheme is not specified to demonstrate correct behavior
http://hg.python.org/cpython/rev/5442a77b925c

New changeset 8928205f57f6 by Senthil Kumaran in branch '3.3':
Fix bpo-16932: Fix the urlparse example. Remote :port when scheme is not specified to demonstrate correct behavior
http://hg.python.org/cpython/rev/8928205f57f6

New changeset 9caad461936e by Senthil Kumaran in branch 'default':
Fix bpo-16932: Fix the urlparse example. Remote :port when scheme is not specified to demonstrate correct behavior
http://hg.python.org/cpython/rev/9caad461936e

orsenthil · 2013-02-26T09:03:53Z

I have fixed the docs issue. Thanks for the report and following up.

orsenthil · 2019-10-18T13:07:37Z

New changeset 5a88d50 by Senthil Kumaran (Tim Graham) in branch 'master':
bpo-27657: Fix urlparse() with numeric paths (#661)
5a88d50

miss-islington · 2019-10-18T13:24:32Z

New changeset 82b5f6b by Miss Islington (bot) in branch '3.7':
bpo-27657: Fix urlparse() with numeric paths (GH-661)
82b5f6b

orsenthil · 2019-10-18T15:23:22Z

New changeset 0f3187c by Senthil Kumaran in branch '3.8':
[3.8] bpo-27657: Fix urlparse() with numeric paths (GH-661) (bpo-16839)
0f3187c

sandrotosi added the stdlib Python modules in the Lib dir label Jan 11, 2013

birkenfeld closed this as completed Jan 11, 2013

birkenfeld added the invalid label Jan 11, 2013

birkenfeld reopened this Jan 11, 2013

orsenthil removed the invalid label Feb 26, 2013

orsenthil closed this as completed Feb 26, 2013

orsenthil self-assigned this Feb 26, 2013

orsenthil added the type-bug An unexpected behavior, bug, or error label Feb 26, 2013

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urlparse fails at parsing "www.python.org:80/" #61136

urlparse fails at parsing "www.python.org:80/" #61136

sandrotosi commented Jan 11, 2013

sandrotosi commented Jan 11, 2013

sandrotosi commented Jan 11, 2013

birkenfeld commented Jan 11, 2013

sandrotosi commented Jan 11, 2013

birkenfeld commented Jan 11, 2013

orsenthil commented Feb 26, 2013

python-dev mannequin commented Feb 26, 2013

orsenthil commented Feb 26, 2013

orsenthil commented Oct 18, 2019

miss-islington commented Oct 18, 2019

orsenthil commented Oct 18, 2019

urlparse fails at parsing "www.python.org:80/" #61136

urlparse fails at parsing "www.python.org:80/" #61136

Comments

sandrotosi commented Jan 11, 2013

sandrotosi commented Jan 11, 2013

sandrotosi commented Jan 11, 2013

birkenfeld commented Jan 11, 2013

sandrotosi commented Jan 11, 2013

birkenfeld commented Jan 11, 2013

orsenthil commented Feb 26, 2013

python-dev mannequin commented Feb 26, 2013

orsenthil commented Feb 26, 2013

orsenthil commented Oct 18, 2019

miss-islington commented Oct 18, 2019

orsenthil commented Oct 18, 2019