Issue 4191: urlparse normalize URL path

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/48441

classification

Title:	urlparse normalize URL path
Type:	behavior	Stage:
Components:		Versions:	Python 2.6

process

Status:	closed	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	ajaksu2, jjlee, monk.e.boy, orsenthil
Priority:	low	Keywords:

Created on 2008-10-24 07:50 by monk.e.boy, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (5)
msg75154 - (view)	Author: monk.e.boy (monk.e.boy)	Date: 2008-10-24 07:50
Hi, The way urljoin works is a bit funky, equivalent paths do not get cleaned in a consistent way: import urlparse import posixpath print urlparse.urljoin('http://www.example.com', '///') print urlparse.urljoin('http://www.example.com/', '///') print urlparse.urljoin('http://www.example.com///', '///') print urlparse.urljoin('http://www.example.com///', '//') print urlparse.urljoin('http://www.example.com///', '/') print urlparse.urljoin('http://www.example.com///', '') print # the above should reduce down to: print posixpath.normpath('///') print print urlparse.urljoin('http://www.example.com///', '.') print urlparse.urljoin('http://www.example.com///', '/.') print urlparse.urljoin('http://www.example.com///', './') print urlparse.urljoin('http://www.example.com///', '/.') print print posixpath.normpath('/.') print print urlparse.urljoin('http://www.example.com///', '..') print urlparse.urljoin('http://www.example.com', '/a/../a/') print urlparse.urljoin('http://www.example.com', '../') print urlparse.urljoin('http://www.example.com', 'a/../a/') print urlparse.urljoin('http://www.example.com', 'a/../a/./') print urlparse.urljoin('http://www.example.com/a/../a/', '../a/./../a/') print urlparse.urljoin('http://www.example.com/a/../a/', '/../a/./../a/') The results of the above code are: http://www.example.com/ http://www.example.com/ http://www.example.com/ http://www.example.com/// http://www.example.com/ http://www.example.com/// / http://www.example.com/// http://www.example.com/. http://www.example.com/// http://www.example.com/. / http://www.example.com http://www.example.com/. http://www.example.com http://www.example.com/. http://www.example.com// http://www.example.com/a/../a/ http://www.example.com/../ http://www.example.com/a/ http://www.example.com/a/ http://www.example.com/a/ http://www.example.com/../a/./../a/ Sometimes the path is cleaned, sometimes it is not. When it is cleaned, the cleaning process is not perfect. The bit of code that is causing problems is commented: # XXX The stuff below is bogus in various ways... If I may be so bold, I would like to see this URL cleaning code stripped from urljoin. A new method/function could be added that cleans a URL. It could have a 'mimic browser' option, because a browser will follow URLs like: http://example.com/../../../ (see this non-bug http://bugs.python.org/issue2583 ) The URL cleaner could use some of the code from "posixpath". Shorter URLs would be preferred over longer (e.g: http://example.com preferred to http://example.com/ ) Thanks, monk.e.boy
msg75851 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2008-11-14 06:35
This report almost seems like a bug with urlparse, but it is not. We have to consider certain cases here. 1) First of all, we cannot equate urlparsing, urlsplit, urljoin with path normalization provided by posixpath.normalize. The reason is the url syntax is strictly by RFCs which are different than Operating system's file and directory naming syntaxes. So, the expectation that urlparse() should return the same result as posixpath.normalize() is wrong. What we can at most look is, does urlparse follow the guidelines mentioned in the RFC1808 to start with and RFC3986 ( Current). 2) Secondly, in a generic sense, it is better to follow the RFC defined parsing rules for URLS than implementing browser behavior. Because, the urlparse needs to parse urls of other schemes also say svn+ssh where a valid url is svn+ssh://localhost///// and in this case '////' is the the name of my directory where I have the source code. Quite possible, right? So, it should not be converted to '/' which will be wrong. 3) And coming down to the more specific issues with the examples presented in this report, urlsplit considers the first '//' to follow the netloc and a single '/' or '///' to be path '/' >>> urlparse.urlsplit('//') SplitResult(scheme='', netloc='', path='', query='', fragment='') >>> urlparse.urlsplit('/') SplitResult(scheme='', netloc='', path='/', query='', fragment='') >>> urlparse.urlsplit('///') SplitResult(scheme='', netloc='', path='/', query='', fragment='') Having this in mind, follow the examples you have provided: print urlparse.urljoin('http://www.example.com///', '//') print urlparse.urljoin('http://www.example.com///', '/') print urlparse.urljoin('http://www.example.com///', '') You will find that they are according the parsing and joining rules as defined in RFC 1808 (http://www.faqs.org/rfcs/rfc1808.html) The same is with other examples, monk.e.boy. If you see that urlparse method has a problem, then please point me to the section in the RFC1808/RFC3986, where it is not confirming, I shall work on the patch to fix. This report, is not a valid bug and can be closed.
msg81846 - (view)	Author: Daniel Diniz (ajaksu2) *	Date: 2009-02-13 01:45
Will close soon if nobody is against it.
msg82401 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2009-02-18 01:54
Please close this, Daniel.
msg82416 - (view)	Author: Daniel Diniz (ajaksu2) *	Date: 2009-02-18 14:38
Thanks Senthil!

History
Date	User	Action	Args
2022-04-11 14:56:40	admin	set	github: 48441
2009-02-18 14:38:01	ajaksu2	set	status: pending -> closed messages: + msg82416
2009-02-18 01:54:42	orsenthil	set	messages: + msg82401
2009-02-18 01:52:17	ajaksu2	set	status: open -> pending priority: low
2009-02-13 01:45:30	ajaksu2	set	nosy: + ajaksu2, jjlee messages: + msg81846
2008-11-14 06:35:40	orsenthil	set	type: behavior messages: + msg75851 nosy: + orsenthil
2008-10-24 07:50:32	monk.e.boy	create