Message75154
Hi,
The way urljoin works is a bit funky, equivalent paths do not get
cleaned in a consistent way:
import urlparse
import posixpath
print urlparse.urljoin('http://www.example.com', '///')
print urlparse.urljoin('http://www.example.com/', '///')
print urlparse.urljoin('http://www.example.com///', '///')
print urlparse.urljoin('http://www.example.com///', '//')
print urlparse.urljoin('http://www.example.com///', '/')
print urlparse.urljoin('http://www.example.com///', '')
print
# the above should reduce down to:
print posixpath.normpath('///')
print
print urlparse.urljoin('http://www.example.com///', '.')
print urlparse.urljoin('http://www.example.com///', '/.')
print urlparse.urljoin('http://www.example.com///', './')
print urlparse.urljoin('http://www.example.com///', '/.')
print
print posixpath.normpath('/.')
print
print urlparse.urljoin('http://www.example.com///', '..')
print urlparse.urljoin('http://www.example.com', '/a/../a/')
print urlparse.urljoin('http://www.example.com', '../')
print urlparse.urljoin('http://www.example.com', 'a/../a/')
print urlparse.urljoin('http://www.example.com', 'a/../a/./')
print urlparse.urljoin('http://www.example.com/a/../a/', '../a/./../a/')
print urlparse.urljoin('http://www.example.com/a/../a/', '/../a/./../a/')
The results of the above code are:
http://www.example.com/
http://www.example.com/
http://www.example.com/
http://www.example.com///
http://www.example.com/
http://www.example.com///
/
http://www.example.com///
http://www.example.com/.
http://www.example.com///
http://www.example.com/.
/
http://www.example.com
http://www.example.com/.
http://www.example.com
http://www.example.com/.
http://www.example.com//
http://www.example.com/a/../a/
http://www.example.com/../
http://www.example.com/a/
http://www.example.com/a/
http://www.example.com/a/
http://www.example.com/../a/./../a/
Sometimes the path is cleaned, sometimes it is not. When it is cleaned,
the cleaning process is not perfect.
The bit of code that is causing problems is commented:
# XXX The stuff below is bogus in various ways...
If I may be so bold, I would like to see this URL cleaning code stripped
from urljoin.
A new method/function could be added that cleans a URL. It could have a
'mimic browser' option, because a browser *will* follow URLs like:
http://example.com/../../../ (see this non-bug
http://bugs.python.org/issue2583 )
The URL cleaner could use some of the code from "posixpath". Shorter
URLs would be preferred over longer (e.g: http://example.com preferred
to http://example.com/ )
Thanks,
monk.e.boy |
|
Date |
User |
Action |
Args |
2008-10-24 07:50:34 | monk.e.boy | set | recipients:
+ monk.e.boy |
2008-10-24 07:50:33 | monk.e.boy | set | messageid: <1224834633.73.0.450098014303.issue4191@psf.upfronthosting.co.za> |
2008-10-24 07:50:32 | monk.e.boy | link | issue4191 messages |
2008-10-24 07:50:31 | monk.e.boy | create | |
|