Message 75154 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	monk.e.boy
Recipients	monk.e.boy
Date	2008-10-24.07:50:30
SpamBayes Score	0.00010032309
Marked as misclassified	No
Message-id	<1224834633.73.0.450098014303.issue4191@psf.upfronthosting.co.za>
In-reply-to

Content
Hi, The way urljoin works is a bit funky, equivalent paths do not get cleaned in a consistent way: import urlparse import posixpath print urlparse.urljoin('http://www.example.com', '///') print urlparse.urljoin('http://www.example.com/', '///') print urlparse.urljoin('http://www.example.com///', '///') print urlparse.urljoin('http://www.example.com///', '//') print urlparse.urljoin('http://www.example.com///', '/') print urlparse.urljoin('http://www.example.com///', '') print # the above should reduce down to: print posixpath.normpath('///') print print urlparse.urljoin('http://www.example.com///', '.') print urlparse.urljoin('http://www.example.com///', '/.') print urlparse.urljoin('http://www.example.com///', './') print urlparse.urljoin('http://www.example.com///', '/.') print print posixpath.normpath('/.') print print urlparse.urljoin('http://www.example.com///', '..') print urlparse.urljoin('http://www.example.com', '/a/../a/') print urlparse.urljoin('http://www.example.com', '../') print urlparse.urljoin('http://www.example.com', 'a/../a/') print urlparse.urljoin('http://www.example.com', 'a/../a/./') print urlparse.urljoin('http://www.example.com/a/../a/', '../a/./../a/') print urlparse.urljoin('http://www.example.com/a/../a/', '/../a/./../a/') The results of the above code are: http://www.example.com/ http://www.example.com/ http://www.example.com/ http://www.example.com/// http://www.example.com/ http://www.example.com/// / http://www.example.com/// http://www.example.com/. http://www.example.com/// http://www.example.com/. / http://www.example.com http://www.example.com/. http://www.example.com http://www.example.com/. http://www.example.com// http://www.example.com/a/../a/ http://www.example.com/../ http://www.example.com/a/ http://www.example.com/a/ http://www.example.com/a/ http://www.example.com/../a/./../a/ Sometimes the path is cleaned, sometimes it is not. When it is cleaned, the cleaning process is not perfect. The bit of code that is causing problems is commented: # XXX The stuff below is bogus in various ways... If I may be so bold, I would like to see this URL cleaning code stripped from urljoin. A new method/function could be added that cleans a URL. It could have a 'mimic browser' option, because a browser will follow URLs like: http://example.com/../../../ (see this non-bug http://bugs.python.org/issue2583 ) The URL cleaner could use some of the code from "posixpath". Shorter URLs would be preferred over longer (e.g: http://example.com preferred to http://example.com/ ) Thanks, monk.e.boy

Hi,

  The way urljoin works is a bit funky, equivalent paths do not get
cleaned in a consistent way:


import urlparse
import posixpath

print urlparse.urljoin('http://www.example.com', '///')
print urlparse.urljoin('http://www.example.com/', '///')
print urlparse.urljoin('http://www.example.com///', '///')
print urlparse.urljoin('http://www.example.com///', '//')
print urlparse.urljoin('http://www.example.com///', '/')
print urlparse.urljoin('http://www.example.com///', '')
print
# the above should reduce down to:
print posixpath.normpath('///')
print
print urlparse.urljoin('http://www.example.com///', '.')
print urlparse.urljoin('http://www.example.com///', '/.')
print urlparse.urljoin('http://www.example.com///', './')
print urlparse.urljoin('http://www.example.com///', '/.')
print
print posixpath.normpath('/.')
print
print urlparse.urljoin('http://www.example.com///', '..')
print urlparse.urljoin('http://www.example.com', '/a/../a/')
print urlparse.urljoin('http://www.example.com', '../')
print urlparse.urljoin('http://www.example.com', 'a/../a/')
print urlparse.urljoin('http://www.example.com', 'a/../a/./')
print urlparse.urljoin('http://www.example.com/a/../a/', '../a/./../a/')
print urlparse.urljoin('http://www.example.com/a/../a/', '/../a/./../a/')

The results of the above code are:

http://www.example.com/
http://www.example.com/
http://www.example.com/
http://www.example.com///
http://www.example.com/
http://www.example.com///

/

http://www.example.com///
http://www.example.com/.
http://www.example.com///
http://www.example.com/.

/

http://www.example.com
http://www.example.com/.
http://www.example.com
http://www.example.com/.

http://www.example.com//
http://www.example.com/a/../a/
http://www.example.com/../
http://www.example.com/a/
http://www.example.com/a/
http://www.example.com/a/
http://www.example.com/../a/./../a/

Sometimes the path is cleaned, sometimes it is not. When it is cleaned,
the cleaning process is not perfect.

The bit of code that is causing problems is commented:

  # XXX The stuff below is bogus in various ways...

If I may be so bold, I would like to see this URL cleaning code stripped
from urljoin.

A new method/function could be added that cleans a URL. It could have a
'mimic browser' option, because a browser *will* follow URLs like:
http://example.com/../../../ (see this non-bug
http://bugs.python.org/issue2583 )

The URL cleaner could use some of the code from "posixpath". Shorter
URLs would be preferred over longer (e.g: http://example.com preferred
to http://example.com/ )

Thanks,

monk.e.boy

History
Date	User	Action	Args
2008-10-24 07:50:34	monk.e.boy	set	recipients: + monk.e.boy
2008-10-24 07:50:33	monk.e.boy	set	messageid: <1224834633.73.0.450098014303.issue4191@psf.upfronthosting.co.za>
2008-10-24 07:50:32	monk.e.boy	link	issue4191 messages
2008-10-24 07:50:31	monk.e.boy	create