classification
Title: Strange behavior of urlparse.urljoin
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.0, Python 2.6, Python 2.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: facundobatista, fantix, ijmorlan, orsenthil, shura_zam, tier, yan
Priority: normal Keywords: patch

Created on 2007-11-13 01:18 by yan, last changed 2008-08-14 19:50 by facundobatista. This issue is now closed.

Files
File name Uploaded Description Edit
issue1432-py26.diff orsenthil, 2008-08-04 09:18
issue1432-py3k.diff orsenthil, 2008-08-04 09:19
Messages (15)
msg57434 - (view) Author: yan (yan) Date: 2007-11-13 01:18
When I use python 2.4/2.5, I found a strange behavior like this:
urlparse.urljoin("http://www.python.org/issue?@template=item","?@template=none")
It will return "http://www.python.org/?@template=none".  But I think it
should be "http://www.python.org/issue?@template=none", right? And I
test it in python 2.3. The result is what I supposed it to be.
msg57485 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2007-11-14 09:26
Not really.
RFC 1808, on which urlparse module is based, defines the following for
the PATH component when joining the relative URL to Base URL.


   Step 6: The last segment of the base URL's path (anything
           following the rightmost slash "/", or the entire path if no
           slash is present) is removed and the embedded URL's path is
           appended in its place. 

So, what is happening is As per design and as per RFC1808. This bug
report can be closed as Working as designed.
msg57488 - (view) Author: yan (yan) Date: 2007-11-14 13:18
Not really, it's just for PATH component. But the QUERY and PARAMETER
are not the same.
just check the RFC1808. 
5.1. Normal Examples
Base: <URL:http://a/b/c/d;p?q#f>
?y         = <URL:http://a/b/c/d;p?y>
;x         = <URL:http://a/b/c/d;x>
msg57583 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2007-11-16 12:38
Yes, you are right.
test_urlparse also does not consider the scenarios wherein the relative url
+starts with a query like ?y.

This needs to be addressed. I shall code the patch to fix it.
msg57584 - (view) Author: yan (yan) Date: 2007-11-16 13:15
That sounds great, thanks a lot.
msg58261 - (view) Author: Fantix King (fantix) Date: 2007-12-07 03:38
This issue also causes similar behavior on some libraries like mechanize
which depend on urljoin
msg58952 - (view) Author: Isaac Morland (ijmorlan) Date: 2007-12-21 18:32
Issue 1637, Issue 1779700, and Issue 1462525 also relate to this problem.
msg58953 - (view) Author: Isaac Morland (ijmorlan) Date: 2007-12-21 18:34
RFC 1808 has been obsoleted by RFC 3986:

http://tools.ietf.org/html/rfc3986
msg59040 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2007-12-29 20:36
If we observe carefully in the urlparse.py and test_urlparse.py, over
the releases from Python 2.3 to Python 2.6, the changes required for
supporting RFC2396 has been implemented. (RFC2396 replaced 1808 in URL
Specification.)
But the header of urlparse.py still says it is according to RFC1808
only. *This needs to be changed.*
In the test_urlparse.py we find test cases for RFC2396 compliance as well.

In this specific bug report, we are upon a case where the later
Specification is not compatible with older one.

As per RFC1808
Base: <URL:http://a/b/c/d;p?q#f>

Relative URL resolution:
?y         = <URL:http://a/b/c/d;p?y>
;x         = <URL:http://a/b/c/d;x>

As per RFC2396
Base: http://a/b/c/d;p?q

Relative URLS:
?y            =  http://a/b/c/?y
;x            =  http://a/b/c/;x

Do you see the difference?
urlparse.py has been made RFC2396 compliance, so that above incompatible
test has been removed as well.

Now, even RFC2396 is obsolete and has been superseded by RFC3986
which advertises thus:

Base: http://a/b/c/d;p?q

Relative URL:

"?y"            =  "http://a/b/c/d;p?y"
";x"            =  "http://a/b/c/;x"

this is crazy, the first ?y goes for  older RFC1808 result and second ;x
is in the later RFC2396.

For the just this issue my take would be:
1) Make the current urlparse.py compliant with RFC2396. Remove the claim
that it is compliant with 1808 only. Which is a documentation fix (patch
attached)

Overall and the best solution will be RFC3986 compliance, which is a
separate effort.
msg69900 - (view) Author: Roman Petrichev (tier) Date: 2008-07-17 19:20
Senthil, please read the RFC3986 text, not only examples.
[Page 31] contains exact algorithm how to handle this case.
--cut--
if (R.path == "") then
   T.path = Base.path;
   if defined(R.query) then
      T.query = R.query;
   else
      T.query = Base.query;
   endif;
--cut--

I.e. instead of:
>>> urljoin('http://www.ya.ru/index.php', '?o=30&a=l')
'http://www.ya.ru/?o=30&a=l'
python SHOULD do:
>>> urljoin('http://www.ya.ru/index.php', '?o=30&a=l')
'http://www.ya.ru/index.php?o=30&a=l'

Look at any browser's handling this case.
msg70689 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008-08-04 09:18
Yes, I agree with you, Roman. 

I have made changes to urlparse.urljoin which would behave confirming to
RFC3986. The join of BASE ("http://a/b/c/d;p?q") with REL("?y") would
result in "http://a/b/c/d;p?y" as expected.

I have added a set of testcases for conformance with RFC3986 as well.

Facundo: would you like to review this patch and commit it?

Thanks!
msg70690 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008-08-04 09:18
Patch for py3k
msg70784 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008-08-06 13:55
Senthil: We should ask for advice in the web-sig list to see if this is
enough a "bug to be fixed" now that we're in beta for the releases.

Thanks!
msg71039 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008-08-12 00:18
Hi Facundo,
I think, we can go ahead and commit the changes. Got a response in
Web-SIG that,previous RFC2396 listed behavior is invalid (in a practical
sense) and the current patch fixes it.
msg71147 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008-08-14 19:50
Commited in revs 65679 and 65680.

Thank you all!!
History
Date User Action Args
2008-08-14 19:50:54facundobatistasetstatus: open -> closed
resolution: fixed
messages: + msg71147
2008-08-12 00:18:01orsenthilsetmessages: + msg71039
2008-08-06 13:55:09facundobatistasetmessages: + msg70784
2008-08-04 09:19:08orsenthilsetfiles: - urlparse.patch
2008-08-04 09:19:00orsenthilsetfiles: + issue1432-py3k.diff
messages: + msg70690
2008-08-04 09:18:18orsenthilsetfiles: + issue1432-py26.diff
nosy: + facundobatista
messages: + msg70689
keywords: + patch
versions: + Python 3.0
2008-07-18 05:19:11shura_zamsetnosy: + shura_zam
2008-07-17 19:20:17tiersetnosy: + tier
messages: + msg69900
2008-01-20 19:55:54christian.heimessetpriority: normal
versions: - Python 2.4
2007-12-29 20:36:40orsenthilsetfiles: + urlparse.patch
messages: + msg59040
2007-12-21 18:34:05ijmorlansetmessages: + msg58953
2007-12-21 18:32:25ijmorlansetnosy: + ijmorlan
messages: + msg58952
2007-12-07 03:38:28fantixsetnosy: + fantix
messages: + msg58261
2007-11-16 13:15:57yansetmessages: + msg57584
2007-11-16 12:38:59orsenthilsetmessages: + msg57583
2007-11-14 13:18:09yansetmessages: + msg57488
2007-11-14 09:26:56orsenthilsetnosy: + orsenthil
messages: + msg57485
2007-11-13 01:18:29yancreate