Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urlparse doesn't handle host?bla #36493

Closed
msdemlei mannequin opened this issue Apr 24, 2002 · 8 comments
Closed

urlparse doesn't handle host?bla #36493

msdemlei mannequin opened this issue Apr 24, 2002 · 8 comments
Labels
stdlib Python modules in the Lib dir

Comments

@msdemlei
Copy link
Mannequin

msdemlei mannequin commented Apr 24, 2002

BPO 548176
Nosy @pfmoore

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2005-01-09.15:33:01.000>
created_at = <Date 2002-04-24.15:36:23.000>
labels = ['library']
title = "urlparse doesn't handle host?bla"
updated_at = <Date 2005-01-09.15:33:01.000>
user = 'https://bugs.python.org/msdemlei'

bugs.python.org fields:

activity = <Date 2005-01-09.15:33:01.000>
actor = 'jlgijsbers'
assignee = 'none'
closed = True
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2002-04-24.15:36:23.000>
creator = 'msdemlei'
dependencies = []
files = []
hgrepos = []
issue_num = 548176
keywords = []
message_count = 8.0
messages = ['10499', '10500', '10501', '10502', '10503', '10504', '10505', '10506']
nosy_count = 6.0
nosy_names = ['jepler', 'paul.moore', 'jlgijsbers', 'msdemlei', 'staschuk', 'mrovner']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue548176'
versions = ['Python 2.4']

@msdemlei
Copy link
Mannequin Author

msdemlei mannequin commented Apr 24, 2002

The urlparse module (at least in 2.2 and 2.1, Linux)
doesn't
handle URLs of the form
http://www.maerkischeallgemeine.de?loc_id=49 correctly
-- everything up to the 9 ends up in the host. I
didn't check the RFC, but in the real world URLs like
this do show up. urlparse works fine when there's a
trailing slash on the host name:
http://www.maerkischeallgemeine.de/?loc_id=49

Example:
<pre>
>>> import urlparse
>>>
urlparse.urlparse("http://www.maerkischeallgemeine.de/?loc_id=49")
('http', 'www.maerkischeallgemeine.de', '/', '',
'loc_id=49', '')
>>>
urlparse.urlparse("http://www.maerkischeallgemeine.de?loc_id=49")
('http', 'www.maerkischeallgemeine.de?loc_id=49', '',
'', '', '')
</pre>

This has serious implications for urllib, since
urllib.urlopen will fail for URLs like the second one,
and with a pretty mysterious exception ("host not
found") at that.

@msdemlei msdemlei mannequin closed this as completed Apr 24, 2002
@msdemlei msdemlei mannequin added stdlib Python modules in the Lib dir labels Apr 24, 2002
@jepler
Copy link
Mannequin

jepler mannequin commented Nov 17, 2002

Logged In: YES
user_id=2772

This actually appears to be permitted by RFC2396
[http://www.ietf.org/rfc/rfc2396.txt]. See section 3.2:

3.2. Authority Component

Many URI schemes include a top hierarchical element for a
naming authority, such that the namespace defined by the
remainder of the URI is governed by that authority. This
authority component is typically defined by an
Internet-based server or a scheme-specific registry of
naming authorities.

      authority     = server | reg_name

The authority component is preceded by a double slash
"//" and is terminated by the next slash "/", question-mark
"?", or by the end of the URI. Within the authority
component, the characters ";", ":", "@", "?", and "/" are
reserved.

@staschuk
Copy link
Mannequin

staschuk mannequin commented Mar 30, 2003

Logged In: YES
user_id=666873

For comparison, RFC 1738 section 3.3:
An HTTP URL takes the form:
http://<host>:<port>/<path>?<searchpart>
[...] If neither <path> nor <searchpart> is present,
the "/" may also be omitted.
... which does not outright say the '/' may *not* be omitted if
<path> is absent but <searchpart> is present (though imho
that's implied).

But even if the / may not be omitted in this case, ? is not
allowed in the authority component under either RFC 2396 or
RFC 1738, so urlparse should either treat it as a delimiter or
reject the URL as malformed. The principle of being lenient in
what you accept favours the former.

I've just submitted a patch (712317) for this.

@mrovner
Copy link
Mannequin

mrovner mannequin commented Jan 27, 2004

Logged In: YES
user_id=162094

According to RFC2396 (ftp://ftp.isi.edu/in-notes/rfc2396.txt)
absoluteURI (part 3 URI Syntactic Components) can be:
"""
<scheme>://<authority><path>?<query>
each of which, except <scheme>, may be absent from a
particular URI.
"""
Later on (3.2):
"""
The authority component is preceded by a double slash "//"
and is terminated by the next slash "/", question-mark "?",
or by the end of the URI.
"""
So URL "http://server?query" is perfectly legal and shall be
allowed and patch 712317 rejected.

@jlgijsbers
Copy link
Mannequin

jlgijsbers mannequin commented Oct 23, 2004

Logged In: YES
user_id=469548

Somehow I think I'm missing something. Please check my line
of reasoning:

  1. http://foo?bar=baz is a legal URL.
  2. urlparse's 'Network location' should be the same as
    <authority> from rfc2396.
  3. Inside <authority> an unescaped '?' is not allowed.
    Rather: <authority> is terminated by the '?'.
  4. Currently the 'network location' for http://foo?bar=baz
    would be 'foo?bar=baz.
  5. If 'network location' should be the same as <authority>,
    it should also be terminated by the '?'.

So shouldn't urlparse.urlsplit('http://foo?bar=baz') return
('http', 'foo', '', '', 'bar=baz', ''), as patch 712317
implements?

@mrovner
Copy link
Mannequin

mrovner mannequin commented Oct 23, 2004

Logged In: YES
user_id=162094

I'm sorry, I misunderstood the patch. If it accepts such URL
and split it at '?', it's perfectly fine.
It shall not reject such URL as malformed.

@pfmoore
Copy link
Member

pfmoore commented Nov 8, 2004

Logged In: YES
user_id=113328

This issue still exists in Python 2.3.4 and Python 2.4b2.

@jlgijsbers
Copy link
Mannequin

jlgijsbers mannequin commented Jan 9, 2005

Logged In: YES
user_id=469548

Fixed by applying patch bpo-712317 on maint24 and HEAD.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir
Projects
None yet
Development

No branches or pull requests

1 participant