classification
Title: urlparse.urlparse misparses URLs with query but no path
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.4, Python 2.5
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: Nosy List: gvanrossum, nagle, vila
Priority: normal Keywords:

Created on 2007-12-16 17:29 by nagle, last changed 2008-01-05 23:35 by gvanrossum. This issue is now closed.

Messages (6)
msg58673 - (view) Author: John Nagle (nagle) Date: 2007-12-16 17:32
urlparse.urlparse will mis-parse URLs which have a "/" after a "?".
>>
>> sa1 = 'http://example.com?blahblah=/foo'
>> sa2 = 'http://example.com?blahblah=foo'
>> print urlparse.urlparse(sa1)
>> ('http', 'example.com?blahblah=', '/foo', '', '', '') # WRONG
>> print urlparse.urlparse(sa2)
>> ('http', 'example.com', '', '', 'blahblah=foo', '') # RIGHT

That's wrong. RFC3896 ("Uniform Resource Identifier (URI): Generic
Syntax"), page 23 says

    "The characters slash ("/") and question mark ("?") may represent data
    within the query component.  Beware that some older, erroneous
    implementations may not handle such data correctly when it is used as
    the base URI for relative references (Section 5.1), apparently
    because they fail to distinguish query data from path data when
    looking for hierarchical separators."

 So "urlparse" is an "older, erroneous implementation".  Looking
 at the code for "urlparse", it references RFC1808 (1995), which
 was a long time ago, three revisions back.
>>
>> Here's the bad code:
>>
>> def _splitnetloc(url, start=0):
>>     for c in '/?#': # the order is important!
>>         delim = url.find(c, start)
>>         if delim >= 0:
>>             break
>>     else:
>>         delim = len(url)
>>     return url[start:delim], url[delim:]
>>
>> That's just wrong.  The domain ends at the first appearance of
>> any character in '/?#', but that code returns the text before the
>> first '/' even if there's an earlier '?'.  A URL/URI doesn't
>> have to have a path, even when it has query parameters. 

OK, here's a fix to "urlparse", replacing _splitnetloc.  I didn't use
a regular expression because "urlparse" doesn't import "re", and I
didn't want to change that.

def _splitnetloc(url, start=0):
    delim = len(url)# position of end of domain part of url, default is end
    for c in '/?#':    # look for delimiters; the order is NOT important   
        wdelim = url.find(c, start)    # find first of this delim
        if wdelim >= 0:            # if found
            delim = min(delim, wdelim)# use earliest delim position
    return url[start:delim], url[delim:]    # return (domain, rest)
msg58761 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-12-18 20:03
Would you mind submitting a proper patch for Python 2.5 and/or 2.6
generated by "svn diff" relative to an (anonymous) checkout, and adding
a unit test?  Then I'd be happy to accept this and if it makes it in
time for the 2.5.2 release we'll fix it there.
msg58842 - (view) Author: John Nagle (nagle) Date: 2007-12-19 23:44
I tried downloading the latest rev of urlparse.py (59480) and it flunked
its own unit test, "urlparse.test()"  Two test cases fail. So I don't
want to try to fix the module until the last people to change it fix
their unit test problems. 
The fix I provided should fix the problem I reported, but I'm not sure
if there's anything else wrong, since it flunks its unit test.
msg58846 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-12-19 23:57
> I tried downloading the latest rev of urlparse.py (59480) and it flunked
> its own unit test, "urlparse.test()"  Two test cases fail.

That's not the official test -- that code should probably be deleted. 
The real test is in Lib/test/test_urlparse.py.  Please ignore that test.
msg59337 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-01-05 22:19
I checked in your change and made up a test.

Committed revision 59758.

Thanks!
msg59339 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-01-05 23:35
Backport to 2.5.2:
Committed revision 59760.
History
Date User Action Args
2008-01-05 23:35:12gvanrossumsetmessages: + msg59339
2008-01-05 22:19:22gvanrossumsetstatus: open -> closed
resolution: accepted
messages: + msg59337
2008-01-05 11:46:56vilasetnosy: + vila
2007-12-19 23:57:26gvanrossumsetmessages: + msg58846
2007-12-19 23:44:43naglesetmessages: + msg58842
2007-12-18 20:03:48gvanrossumsetpriority: normal
nosy: + gvanrossum
messages: + msg58761
2007-12-16 17:32:54naglesetmessages: + msg58673
2007-12-16 17:29:43naglecreate