Title: urljoining an empty query string doesn't clear query string
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.11, Python 3.10, Python 3.9
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: asvetlov, iritkatriel, orsenthil, pfish
Priority: normal Keywords: patch

Created on 2018-02-06 04:48 by pfish, last changed 2021-05-28 14:36 by pfish.

Pull Requests
URL Status Linked Edit
PR 5645 open python-dev, 2018-02-12 22:01
Messages (7)
msg311704 - (view) Author: Paul Fisher (pfish) * Date: 2018-02-06 04:48
urljoining with '?' will not clear a query string:

>>> import urllib.parse
>>> urllib.parse.urljoin('http://a/b/c?d=e', '?')

'http://a/b/c' (optionally, with a ? at the end)

WhatWG's URL standard expects a relative URL consisting of only a ? to replace a query string:

Seen in versions 3.6 and 2.7, but probably also affects later versions.
msg311937 - (view) Author: Paul Fisher (pfish) * Date: 2018-02-10 06:05
I'm working on a patch for this and can have one up in the next week or so, once I get the CLA signed and other boxes ticked.  I'm new to the Github process but hopefully it will be a good start for the discussion.
msg312201 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2018-02-15 11:04
Python follows not WhatWG but RFC. is proper definition for url joining algorithm.
msg312223 - (view) Author: Paul Fisher (pfish) * Date: 2018-02-15 20:28
In this case, the RFC is mismatched from the actual behaviour of browsers (as described and codified by WhatWG).  It was surprising to me that urljoin() didn't do what I percieved as "the right thing" (and I expect other users would too).

I would personally expect urljoin to do "the thing that everybody else does".  Is there a sensible way to reduce this mismatch?

For reference, Java's stdlib does what I would expect here:

    URI base = URI.create("");
    URI rel = base.resolve("?");
msg394648 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-05-28 09:52
The relevant part in the RFC pseudo code is 

               if defined(R.query) then
                  T.query = R.query;
                  T.query = Base.query;

which is implemented in urljoin as:

        if not query:
            query = bquery

Is this correct? Should the code not say "if query is not None"?
(I can't see in the RFC a definition of defined()).
msg394649 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-05-28 10:00
Sorry, urlparse returns '' rather than None when there is no query.
So we indeed need to check something like 
    if '?' not in url:
or what's in Paul's patch. 

However, my main point was to question whether fixing this is actually in contradiction with the RFC.
msg394664 - (view) Author: Paul Fisher (pfish) * Date: 2021-05-28 14:36
Reading more into this, from section 5.2,1:

> A component is undefined if its associated delimiter does not appear in the URI reference

So you could say that since there is a '?', the query component is *defined*, but *empty*. This would mean that assigning the target query to be '' has the desired effect as implemented by browsers and other languages' standard libraries.
Date User Action Args
2021-05-28 14:36:47pfishsetmessages: + msg394664
2021-05-28 10:00:15iritkatrielsetmessages: + msg394649
2021-05-28 09:52:36iritkatrielsetnosy: + iritkatriel

messages: + msg394648
versions: + Python 3.9, Python 3.10, Python 3.11, - Python 2.7, Python 3.6
2018-02-15 20:28:51pfishsetmessages: + msg312223
2018-02-15 11:04:20asvetlovsetnosy: + asvetlov
messages: + msg312201
2018-02-12 22:01:41python-devsetkeywords: + patch
stage: patch review
pull_requests: + pull_request5446
2018-02-10 06:05:14pfishsetmessages: + msg311937
2018-02-10 03:38:04terry.reedysetnosy: + orsenthil
2018-02-06 04:48:49pfishcreate