urlparse.urlparse misparses URLs with query but no path #45978

nagle · 2007-12-16T17:29:44Z

BPO	1637
Nosy	@gvanrossum

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2008-01-05.22:19:22.287>
created_at = <Date 2007-12-16.17:29:43.556>
labels = ['type-bug', 'library']
title = 'urlparse.urlparse misparses URLs with query but no path'
updated_at = <Date 2008-01-05.23:35:12.431>
user = 'https://bugs.python.org/nagle'

bugs.python.org fields:

activity = <Date 2008-01-05.23:35:12.431>
actor = 'gvanrossum'
assignee = 'none'
closed = True
closed_date = <Date 2008-01-05.22:19:22.287>
closer = 'gvanrossum'
components = ['Library (Lib)']
creation = <Date 2007-12-16.17:29:43.556>
creator = 'nagle'
dependencies = []
files = []
hgrepos = []
issue_num = 1637
keywords = []
message_count = 6.0
messages = ['58673', '58761', '58842', '58846', '59337', '59339']
nosy_count = 3.0
nosy_names = ['gvanrossum', 'nagle', 'vila']
pr_nums = []
priority = 'normal'
resolution = 'accepted'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue1637'
versions = ['Python 2.5', 'Python 2.4']

nagle · 2007-12-16T17:32:53Z

urlparse.urlparse will mis-parse URLs which have a "/" after a "?".

>
> sa1 = 'http://example.com?blahblah=/foo'
> sa2 = 'http://example.com?blahblah=foo'
> print urlparse.urlparse(sa1)
> ('http', 'example.com?blahblah=', '/foo', '', '', '') # WRONG
> print urlparse.urlparse(sa2)
> ('http', 'example.com', '', '', 'blahblah=foo', '') # RIGHT

That's wrong. RFC3896 ("Uniform Resource Identifier (URI): Generic
Syntax"), page 23 says

"The characters slash ("/") and question mark ("?") may represent data
within the query component.  Beware that some older, erroneous
implementations may not handle such data correctly when it is used as
the base URI for relative references (Section 5.1), apparently
because they fail to distinguish query data from path data when
looking for hierarchical separators."

So "urlparse" is an "older, erroneous implementation". Looking
at the code for "urlparse", it references RFC1808 (1995), which
was a long time ago, three revisions back.

>
> Here's the bad code:
>
> def _splitnetloc(url, start=0):
> for c in '/?#': # the order is important!
> delim = url.find(c, start)
> if delim >= 0:
> break
> else:
> delim = len(url)
> return url[start:delim], url[delim:]
>
> That's just wrong. The domain ends at the first appearance of
> any character in '/?#', but that code returns the text before the
> first '/' even if there's an earlier '?'. A URL/URI doesn't
> have to have a path, even when it has query parameters.

OK, here's a fix to "urlparse", replacing _splitnetloc. I didn't use
a regular expression because "urlparse" doesn't import "re", and I
didn't want to change that.

def _splitnetloc(url, start=0):
    delim = len(url)# position of end of domain part of url, default is end
    for c in '/?#':    # look for delimiters; the order is NOT important   
        wdelim = url.find(c, start)    # find first of this delim
        if wdelim >= 0:            # if found
            delim = min(delim, wdelim)# use earliest delim position
    return url[start:delim], url[delim:]    # return (domain, rest)

gvanrossum · 2007-12-18T20:03:48Z

Would you mind submitting a proper patch for Python 2.5 and/or 2.6
generated by "svn diff" relative to an (anonymous) checkout, and adding
a unit test? Then I'd be happy to accept this and if it makes it in
time for the 2.5.2 release we'll fix it there.

nagle · 2007-12-19T23:44:43Z

I tried downloading the latest rev of urlparse.py (59480) and it flunked
its own unit test, "urlparse.test()" Two test cases fail. So I don't
want to try to fix the module until the last people to change it fix
their unit test problems.
The fix I provided should fix the problem I reported, but I'm not sure
if there's anything else wrong, since it flunks its unit test.

gvanrossum · 2007-12-19T23:57:26Z

I tried downloading the latest rev of urlparse.py (59480) and it flunked
its own unit test, "urlparse.test()" Two test cases fail.

That's not the official test -- that code should probably be deleted.
The real test is in Lib/test/test_urlparse.py. Please ignore that test.

gvanrossum · 2008-01-05T22:19:22Z

I checked in your change and made up a test.

Committed revision 59758.

Thanks!

gvanrossum · 2008-01-05T23:35:12Z

Backport to 2.5.2:
Committed revision 59760.

nagle mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Dec 16, 2007

gvanrossum closed this as completed Jan 5, 2008

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urlparse.urlparse misparses URLs with query but no path #45978

urlparse.urlparse misparses URLs with query but no path #45978

nagle mannequin commented Dec 16, 2007

nagle mannequin commented Dec 16, 2007

gvanrossum commented Dec 18, 2007

nagle mannequin commented Dec 19, 2007

gvanrossum commented Dec 19, 2007

gvanrossum commented Jan 5, 2008

gvanrossum commented Jan 5, 2008

urlparse.urlparse misparses URLs with query but no path #45978

urlparse.urlparse misparses URLs with query but no path #45978

Comments

nagle mannequin commented Dec 16, 2007

nagle mannequin commented Dec 16, 2007

gvanrossum commented Dec 18, 2007

nagle mannequin commented Dec 19, 2007

gvanrossum commented Dec 19, 2007

gvanrossum commented Jan 5, 2008

gvanrossum commented Jan 5, 2008