Message58673
urlparse.urlparse will mis-parse URLs which have a "/" after a "?".
>>
>> sa1 = 'http://example.com?blahblah=/foo'
>> sa2 = 'http://example.com?blahblah=foo'
>> print urlparse.urlparse(sa1)
>> ('http', 'example.com?blahblah=', '/foo', '', '', '') # WRONG
>> print urlparse.urlparse(sa2)
>> ('http', 'example.com', '', '', 'blahblah=foo', '') # RIGHT
That's wrong. RFC3896 ("Uniform Resource Identifier (URI): Generic
Syntax"), page 23 says
"The characters slash ("/") and question mark ("?") may represent data
within the query component. Beware that some older, erroneous
implementations may not handle such data correctly when it is used as
the base URI for relative references (Section 5.1), apparently
because they fail to distinguish query data from path data when
looking for hierarchical separators."
So "urlparse" is an "older, erroneous implementation". Looking
at the code for "urlparse", it references RFC1808 (1995), which
was a long time ago, three revisions back.
>>
>> Here's the bad code:
>>
>> def _splitnetloc(url, start=0):
>> for c in '/?#': # the order is important!
>> delim = url.find(c, start)
>> if delim >= 0:
>> break
>> else:
>> delim = len(url)
>> return url[start:delim], url[delim:]
>>
>> That's just wrong. The domain ends at the first appearance of
>> any character in '/?#', but that code returns the text before the
>> first '/' even if there's an earlier '?'. A URL/URI doesn't
>> have to have a path, even when it has query parameters.
OK, here's a fix to "urlparse", replacing _splitnetloc. I didn't use
a regular expression because "urlparse" doesn't import "re", and I
didn't want to change that.
def _splitnetloc(url, start=0):
delim = len(url)# position of end of domain part of url, default is end
for c in '/?#': # look for delimiters; the order is NOT important
wdelim = url.find(c, start) # find first of this delim
if wdelim >= 0: # if found
delim = min(delim, wdelim)# use earliest delim position
return url[start:delim], url[delim:] # return (domain, rest) |
|
Date |
User |
Action |
Args |
2007-12-16 17:32:54 | nagle | set | spambayes_score: 0.00276929 -> 0.00276929 recipients:
+ nagle |
2007-12-16 17:32:54 | nagle | set | spambayes_score: 0.00276929 -> 0.00276929 messageid: <1197826374.62.0.356704727392.issue1637@psf.upfronthosting.co.za> |
2007-12-16 17:32:54 | nagle | link | issue1637 messages |
2007-12-16 17:32:53 | nagle | create | |
|