urlparse doesn't handle host?bla #36493

msdemlei · 2002-04-24T15:36:23Z

BPO	548176
Nosy	@pfmoore

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2005-01-09.15:33:01.000>
created_at = <Date 2002-04-24.15:36:23.000>
labels = ['library']
title = "urlparse doesn't handle host?bla"
updated_at = <Date 2005-01-09.15:33:01.000>
user = 'https://bugs.python.org/msdemlei'

bugs.python.org fields:

activity = <Date 2005-01-09.15:33:01.000>
actor = 'jlgijsbers'
assignee = 'none'
closed = True
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2002-04-24.15:36:23.000>
creator = 'msdemlei'
dependencies = []
files = []
hgrepos = []
issue_num = 548176
keywords = []
message_count = 8.0
messages = ['10499', '10500', '10501', '10502', '10503', '10504', '10505', '10506']
nosy_count = 6.0
nosy_names = ['jepler', 'paul.moore', 'jlgijsbers', 'msdemlei', 'staschuk', 'mrovner']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue548176'
versions = ['Python 2.4']

msdemlei · 2002-04-24T15:36:23Z

The urlparse module (at least in 2.2 and 2.1, Linux)
doesn't
handle URLs of the form
http://www.maerkischeallgemeine.de?loc_id=49 correctly
-- everything up to the 9 ends up in the host. I
didn't check the RFC, but in the real world URLs like
this do show up. urlparse works fine when there's a
trailing slash on the host name:
http://www.maerkischeallgemeine.de/?loc_id=49

Example:
<pre>
>>> import urlparse
>>>
urlparse.urlparse("http://www.maerkischeallgemeine.de/?loc_id=49")
('http', 'www.maerkischeallgemeine.de', '/', '',
'loc_id=49', '')
>>>
urlparse.urlparse("http://www.maerkischeallgemeine.de?loc_id=49")
('http', 'www.maerkischeallgemeine.de?loc_id=49', '',
'', '', '')
</pre>

This has serious implications for urllib, since
urllib.urlopen will fail for URLs like the second one,
and with a pretty mysterious exception ("host not
found") at that.

jepler · 2002-11-17T16:56:22Z

Logged In: YES
user_id=2772

This actually appears to be permitted by RFC2396
[http://www.ietf.org/rfc/rfc2396.txt]. See section 3.2:

3.2. Authority Component

Many URI schemes include a top hierarchical element for a
naming authority, such that the namespace defined by the
remainder of the URI is governed by that authority. This
authority component is typically defined by an
Internet-based server or a scheme-specific registry of
naming authorities.

      authority     = server | reg_name

The authority component is preceded by a double slash
"//" and is terminated by the next slash "/", question-mark
"?", or by the end of the URI. Within the authority
component, the characters ";", ":", "@", "?", and "/" are
reserved.

staschuk · 2003-03-30T20:19:43Z

Logged In: YES
user_id=666873

For comparison, RFC 1738 section 3.3:
An HTTP URL takes the form:
http://<host>:<port>/<path>?<searchpart>
[...] If neither <path> nor <searchpart> is present,
the "/" may also be omitted.
... which does not outright say the '/' may *not* be omitted if
<path> is absent but <searchpart> is present (though imho
that's implied).

But even if the / may not be omitted in this case, ? is not
allowed in the authority component under either RFC 2396 or
RFC 1738, so urlparse should either treat it as a delimiter or
reject the URL as malformed. The principle of being lenient in
what you accept favours the former.

I've just submitted a patch (712317) for this.

mrovner · 2004-01-27T01:13:02Z

Logged In: YES
user_id=162094

According to RFC2396 (ftp://ftp.isi.edu/in-notes/rfc2396.txt)
absoluteURI (part 3 URI Syntactic Components) can be:
"""
<scheme>://<authority><path>?<query>
each of which, except <scheme>, may be absent from a
particular URI.
"""
Later on (3.2):
"""
The authority component is preceded by a double slash "//"
and is terminated by the next slash "/", question-mark "?",
or by the end of the URI.
"""
So URL "http://server?query" is perfectly legal and shall be
allowed and patch 712317 rejected.

jlgijsbers · 2004-10-23T07:03:03Z

Logged In: YES
user_id=469548

Somehow I think I'm missing something. Please check my line
of reasoning:

http://foo?bar=baz is a legal URL.
urlparse's 'Network location' should be the same as
<authority> from rfc2396.
Inside <authority> an unescaped '?' is not allowed.
Rather: <authority> is terminated by the '?'.
Currently the 'network location' for http://foo?bar=baz
would be 'foo?bar=baz.
If 'network location' should be the same as <authority>,
it should also be terminated by the '?'.

So shouldn't urlparse.urlsplit('http://foo?bar=baz') return
('http', 'foo', '', '', 'bar=baz', ''), as patch 712317
implements?

mrovner · 2004-10-23T07:44:50Z

Logged In: YES
user_id=162094

I'm sorry, I misunderstood the patch. If it accepts such URL
and split it at '?', it's perfectly fine.
It shall not reject such URL as malformed.

pfmoore · 2004-11-08T20:48:58Z

Logged In: YES
user_id=113328

This issue still exists in Python 2.3.4 and Python 2.4b2.

jlgijsbers · 2005-01-09T15:33:01Z

Logged In: YES
user_id=469548

Fixed by applying patch bpo-712317 on maint24 and HEAD.

msdemlei mannequin closed this as completed Apr 24, 2002

msdemlei mannequin added stdlib Python modules in the Lib dir labels Apr 24, 2002

ezio-melotti transferred this issue from another repository Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urlparse doesn't handle host?bla #36493

urlparse doesn't handle host?bla #36493

msdemlei mannequin commented Apr 24, 2002

msdemlei mannequin commented Apr 24, 2002

jepler mannequin commented Nov 17, 2002

staschuk mannequin commented Mar 30, 2003

mrovner mannequin commented Jan 27, 2004

jlgijsbers mannequin commented Oct 23, 2004

mrovner mannequin commented Oct 23, 2004

pfmoore commented Nov 8, 2004

jlgijsbers mannequin commented Jan 9, 2005

urlparse doesn't handle host?bla #36493

urlparse doesn't handle host?bla #36493

Comments

msdemlei mannequin commented Apr 24, 2002

msdemlei mannequin commented Apr 24, 2002

jepler mannequin commented Nov 17, 2002

staschuk mannequin commented Mar 30, 2003

mrovner mannequin commented Jan 27, 2004

jlgijsbers mannequin commented Oct 23, 2004

mrovner mannequin commented Oct 23, 2004

pfmoore commented Nov 8, 2004

jlgijsbers mannequin commented Jan 9, 2005