Message 58673 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	nagle
Recipients	nagle
Date	2007-12-16.17:32:52
SpamBayes Score	0.002769292
Marked as misclassified	No
Message-id	<1197826374.62.0.356704727392.issue1637@psf.upfronthosting.co.za>
In-reply-to

Content
urlparse.urlparse will mis-parse URLs which have a "/" after a "?". >> >> sa1 = 'http://example.com?blahblah=/foo' >> sa2 = 'http://example.com?blahblah=foo' >> print urlparse.urlparse(sa1) >> ('http', 'example.com?blahblah=', '/foo', '', '', '') # WRONG >> print urlparse.urlparse(sa2) >> ('http', 'example.com', '', '', 'blahblah=foo', '') # RIGHT That's wrong. RFC3896 ("Uniform Resource Identifier (URI): Generic Syntax"), page 23 says "The characters slash ("/") and question mark ("?") may represent data within the query component. Beware that some older, erroneous implementations may not handle such data correctly when it is used as the base URI for relative references (Section 5.1), apparently because they fail to distinguish query data from path data when looking for hierarchical separators." So "urlparse" is an "older, erroneous implementation". Looking at the code for "urlparse", it references RFC1808 (1995), which was a long time ago, three revisions back. >> >> Here's the bad code: >> >> def _splitnetloc(url, start=0): >> for c in '/?#': # the order is important! >> delim = url.find(c, start) >> if delim >= 0: >> break >> else: >> delim = len(url) >> return url[start:delim], url[delim:] >> >> That's just wrong. The domain ends at the first appearance of >> any character in '/?#', but that code returns the text before the >> first '/' even if there's an earlier '?'. A URL/URI doesn't >> have to have a path, even when it has query parameters. OK, here's a fix to "urlparse", replacing _splitnetloc. I didn't use a regular expression because "urlparse" doesn't import "re", and I didn't want to change that. def _splitnetloc(url, start=0): delim = len(url)# position of end of domain part of url, default is end for c in '/?#': # look for delimiters; the order is NOT important wdelim = url.find(c, start) # find first of this delim if wdelim >= 0: # if found delim = min(delim, wdelim)# use earliest delim position return url[start:delim], url[delim:] # return (domain, rest)

urlparse.urlparse will mis-parse URLs which have a "/" after a "?".
>>
>> sa1 = 'http://example.com?blahblah=/foo'
>> sa2 = 'http://example.com?blahblah=foo'
>> print urlparse.urlparse(sa1)
>> ('http', 'example.com?blahblah=', '/foo', '', '', '') # WRONG
>> print urlparse.urlparse(sa2)
>> ('http', 'example.com', '', '', 'blahblah=foo', '') # RIGHT

That's wrong. RFC3896 ("Uniform Resource Identifier (URI): Generic
Syntax"), page 23 says

    "The characters slash ("/") and question mark ("?") may represent data
    within the query component.  Beware that some older, erroneous
    implementations may not handle such data correctly when it is used as
    the base URI for relative references (Section 5.1), apparently
    because they fail to distinguish query data from path data when
    looking for hierarchical separators."

 So "urlparse" is an "older, erroneous implementation".  Looking
 at the code for "urlparse", it references RFC1808 (1995), which
 was a long time ago, three revisions back.
>>
>> Here's the bad code:
>>
>> def _splitnetloc(url, start=0):
>>     for c in '/?#': # the order is important!
>>         delim = url.find(c, start)
>>         if delim >= 0:
>>             break
>>     else:
>>         delim = len(url)
>>     return url[start:delim], url[delim:]
>>
>> That's just wrong.  The domain ends at the first appearance of
>> any character in '/?#', but that code returns the text before the
>> first '/' even if there's an earlier '?'.  A URL/URI doesn't
>> have to have a path, even when it has query parameters. 

OK, here's a fix to "urlparse", replacing _splitnetloc.  I didn't use
a regular expression because "urlparse" doesn't import "re", and I
didn't want to change that.

def _splitnetloc(url, start=0):
    delim = len(url)# position of end of domain part of url, default is end
    for c in '/?#':    # look for delimiters; the order is NOT important   
        wdelim = url.find(c, start)    # find first of this delim
        if wdelim >= 0:            # if found
            delim = min(delim, wdelim)# use earliest delim position
    return url[start:delim], url[delim:]    # return (domain, rest)

History
Date	User	Action	Args
2007-12-16 17:32:54	nagle	set	spambayes_score: 0.00276929 -> 0.002769292 recipients: + nagle
2007-12-16 17:32:54	nagle	set	spambayes_score: 0.00276929 -> 0.00276929 messageid: <1197826374.62.0.356704727392.issue1637@psf.upfronthosting.co.za>
2007-12-16 17:32:54	nagle	link	issue1637 messages
2007-12-16 17:32:53	nagle	create