Issue 18828: urljoin behaves differently with custom and standard schemas

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/63028

classification

Title:	urljoin behaves differently with custom and standard schemas
Type:	behavior	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.4, Python 3.5, Python 2.7

process

Status:	open	Resolution:
Dependencies:	16134 23759 25895	Superseder:
Assigned To:		Nosy List:	berker.peksag, demian.brecht, madison.may, martin.panter, mher, orsenthil
Priority:	normal	Keywords:	patch

Created on 2013-08-25 12:52 by mher, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
urljoin-scheme.patch	martin.panter, 2014-12-17 04:28	scheme not checked	review
urljoin-non-hier.patch	martin.panter, 2015-03-26 11:26	join if not in non_hierarchical	review

Messages (16)
msg196125 - (view)	Author: Mher Movsisyan (mher)	Date: 2013-08-25 12:51
>>> urljoin('redis://localhost:6379/0', '/1') '/1' >>> urljoin('http://localhost:6379/0', '/1') 'http://localhost:6379/1'
msg196230 - (view)	Author: Madison May (madison.may) *	Date: 2013-08-26 17:34
From urllib.parse: uses_relative = ['ftp', 'http', 'gopher', 'nntp', 'imap', 'wais', 'file', 'https', 'shttp', 'mms', 'prospero', 'rtsp', 'rtspu', '', 'sftp', 'svn', 'svn+ssh'] From urllib.parse.urljoin (scheme='redis' and url='/1' in your example): if scheme != bscheme or scheme not in uses_relative: return _coerce_result(url) Should the 'redis' scheme be added to uses_relative, perhaps?
msg196515 - (view)	Author: Martin Panter (martin.panter) *	Date: 2013-08-30 06:08
Similarly, I expected this to return "rtmp://host/app?auth=token": urljoin("rtmp://host/app", "?auth=token") I'm not sure adding everybody's custom scheme to a hard-coded whitelist is the best way to do solve this. Below I have identified some other schemes not in the "uses_relative" list. Is there any reason why one would use urljoin() with them, but want the base URL to be ignored (as is the current behaviour)? I looked at test_urlparse.py and there doesn't seem to be any test cases for these schemes. >>> all = set().union(uses_relative, uses_netloc, uses_params, non_hierarchical, uses_query, uses_fragment) >>> sorted(all.difference(uses_relative)) ['git', 'git+ssh', 'hdl', 'mailto', 'news', 'nfs', 'rsync', 'sip', 'sips', 'snews', 'tel', 'telnet'] Even if the behaviour can't be changed, could the documentation for urljoin() say something like this: Only the following [uses_relative] schemes are allowed in the base URL; any other schemes result in the relative URL being returned without being joined to the base.
msg196775 - (view)	Author: Berker Peksag (berker.peksag) *	Date: 2013-09-02 08:59
How about adding a codecs.register like public API for 3.4+? import urllib.parse urllib.parse.schemes.register('redis', 'rtmp') or: urllib.parse.urljoin('redis://localhost:6379/0', '/1', scheme='redis') or just: urllib.parse.schemes.extend(['redis', 'rtmp'])
msg196794 - (view)	Author: Madison May (madison.may) *	Date: 2013-09-02 17:39
If nothing else, we should document the work around for this issue. >>> import urllib.parse >>> urllib.parse.uses_relative.append('redis') >>> urllib.parse.uses_netloc.append('redis') >>> urllib.parse.urljoin('redis://localhost:6379/0', '/1') 'redis://localhost:6379/1'
msg196795 - (view)	Author: Madison May (madison.may) *	Date: 2013-09-02 17:49
>How about adding a codecs.register like public API for 3.4+? A codecs style register function seems like an excellent solution to me.
msg232799 - (view)	Author: Martin Panter (martin.panter) *	Date: 2014-12-17 04:28
I think a global registry seems like overkill. Here is a patch to make urljoin() treat schemes more equally and work with arbitrary schemes automatically. I haven’t heard any arguments against this option yet, and it didn’t break any tests. Another option, still simpler than a registry, would be an extra parameter, say urljoin(a, b, any_scheme=True).
msg238363 - (view)	Author: Demian Brecht (demian.brecht) *	Date: 2015-03-17 23:47
> I haven’t heard any arguments against this option yet, and it didn’t break any tests. Pre patch: >>> urljoin('mailto:foo@', 'bar.com') 'bar.com' Post patch: >>> urljoin('mailto:foo@', 'bar.com') 'mailto:bar.com/bar.com' I'm taking an educated guess here based on a marginal amount of research (there are just a few registered schemes at http://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml that should be understood), but it /seems/ like perhaps the current behaviour is intended to safeguard against joining non-hierarchical schemes in which case you'd get nonsensical values. It does seem a little odd to me, but I definitely prefer the former behaviour to the latter. I think that short term, Madison's suggestion about documenting uses_relative would be an easy win and can be applied to all branches. Long term though, I think it would be nice to have a generalized urljoin() method that accounts for most (if not all) classifications of url schemes. Thoughts?
msg238497 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-03-19 07:25
I opened Issue 23703 about the funny doubled bar.com result. After backing out revision 901e4e52b20a, but with my patch here applied: >>> urljoin('mailto:foo@', 'bar.com') 'mailto:bar.com' which seems fairly sensible to me. A more awkward question is if this behaviour of my patch is reasonable: >>> urljoin('mailto:person-foo/bar@example.net', 'bar.com') 'mailto:person-foo/bar.com' Yet another option, similar to my “any_scheme=True” flag, might be to change from the “uses_relative” white-list to a “not_relative” black-list of URL schemes, so that urljoin() works for arbitrary schemes except for ones like “mailto:” that are in the hard-coded list.
msg238534 - (view)	Author: Demian Brecht (demian.brecht) *	Date: 2015-03-19 15:36
>>>> urljoin('mailto:foo@', 'bar.com') > 'mailto:bar.com' > > which seems fairly sensible to me. This is where joining arbitrary protocols gets tricky. Does it make sense to merge non-hierarchical protocols such as mailto? My initial reaction is "no" and what should actually happen here is one of two things: 1. The result is a simple concatenation: "mailto:foo@bar.com". 2. An exception is raised indicating that urljoin cannot determine how to handle merging base and url. The above could happen in cases where either scheme is None for both base and url or the scheme to be used is any of urllib.parse.non_hierarchical. > A more awkward question is if this behaviour of my patch is reasonable: > >>>> urljoin('mailto:person-foo/bar@example.net', 'bar.com') > 'mailto:person-foo/bar.com' A couple thoughts on this: If urllib.parse.non_hierarchical is used to determine merge vs. simple concat (or exception), this specific case won't be an issue. Also, according to 6068, "mailto:person-foo/bar@example.net' is invalid (the "/" should be percent-encoded), but I don't think it should be the job of urljoin to understand the URI structures of each protocol, outside of logically join base and url. > Yet another option, similar to my “any_scheme=True” flag, might be to change from the “uses_relative” white-list to a “not_relative” black-list of URL schemes, so that urljoin() works for arbitrary schemes except for ones like “mailto:” that are in the hard-coded list. This list may already be present in urllib.parse.non_hierarchical. I also think it's worthwhile to do some further research against http://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml to ensure the list is up to date. If this path is chosen, I would suggest getting sign off from a couple core devs prior to investing time in this as all changes discussed so far are backwards incompatible.
msg238536 - (view)	Author: Demian Brecht (demian.brecht) *	Date: 2015-03-19 15:40
Also, I would suggest still including the doc changes proposed by Madison in all versions prior to 3.5.
msg239125 - (view)	Author: Berker Peksag (berker.peksag) *	Date: 2015-03-24 14:18
> Yet another option, similar to my “any_scheme=True” flag, might be to change from the “uses_relative” white-list to a “not_relative” black-list of URL schemes, [...] I think this looks like a good solution.
msg239324 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-03-26 11:26
The current behaviour when no scheme is present is fairly sensible to me and should not be changed to do string concatenation nor raise an exception: >>> urljoin("//netloc/old/path", "new/path") '//netloc/old/new/path' I am posting urljoin-non-hier.patch as an alternative to my first patch. This one changes urljoin() to work on any URL scheme not in the existing “non_hierarchical” blacklist. I removed the gopher, wais, and imap schemes from the list, and added tel, so that urljoin() continues to treat these special cases as before. Out of the schemes mentioned in the module but missing from uses_relative, I think non_hierarchical now has all those without directory components: hdl, mailto, news, sip, sips, snews, tel, telnet. However I am still not really convinced that my first urljoin-scheme.patch is a bad idea. Do people actually use urljoin() with these schemes like mailto in the first place?
msg239341 - (view)	Author: Demian Brecht (demian.brecht) *	Date: 2015-03-26 15:35
> The current behaviour when no scheme is present is fairly sensible to me and should not be changed to do string concatenation nor raise an exception Agreed. Defaulting to relative behaviour makes sense as I imagine that'll be the general use case. > I removed the gopher, wais, and imap schemes from the list I'd be concerned about removing items as non_hierarchical /is/ public facing and it's reasonable to assume that there are libraries out there that depend on these. Additionally, at a glance through their respective RFCs, it seems that these three protocols /do/ belong in the non_hierarchical list. While WAIS and IMAP do use / as a delimiter, they're not hierarchical and therefore relative joining doesn't make sense. For example, with the following definition in mind (RFC4156): wais://<host>:<port>/<database>/<wtype>/<wpath> The following will result in an incorrect URL: urljoin('wais://foo@bar.com/mydb/type/path', '/newpath') > However I am still not really convinced that my first urljoin-scheme.patch is a bad idea. Do people actually use urljoin() with these schemes like mailto in the first place? I'd be inclined to agree that it's far from common practice. That said, I did find one instance of a project that seems to depend on current behaviour (although it's only in tests and I haven't looked any deeper): https://github.com/chfoo/wpull/blob/32837d7c5614d7f90b8242e1fbb41f8da9bc7ce7/wpull/url_test.py#L637. I imagine that the current behaviour may possibly be useful for utilities such as web crawlers. In light of that and the fact that the urllib.parse docs currently has a list of protocols that are intended to be supported across the entire module's API, I think that it's important to not break backwards compatibility in cases where the relative URL would have been returned. Your second patch seems to have this behaviour which I think is preferable.
msg239363 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-03-26 23:09
If necessary, we can add a new non_relative list, rather than changing non_hierarchical. The repository history shows that “non_hierarchical” was updated with new schemes once or twice, but has never been used since it was added to Python as “urlparse.py”. IMAP, WAIS and Gopher URLs can have extra components added using slashes, which satisfies my idea of “hierarchical”. For IMAP, I think this is explicitly mentioned in the RFC: <https://tools.ietf.org/html/rfc5092#section-7>. For WAIS, the hierarchy is not arbitrary, but your resulting URL wais://foo@bar.com/newpath probably matches the wais://<host>:<port>/<database> URL form, and I am not proposing to change that behaviour.
msg276162 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-09-13 04:22
Recording bugs reports for specific schemes as dependencies of this: Issue 25895: ws(s) Issue 16134: rtmp(e/s/t) Issue 23759: coap(s)

History
Date	User	Action	Args
2022-04-11 14:57:50	admin	set	github: 63028
2016-09-13 04:22:49	martin.panter	set	dependencies: + Add support for RTMP schemes to urlparse, urllib.parse: make coap:// known, urllib.parse.urljoin does not handle WebSocket URLs messages: + msg276162
2015-03-26 23:09:49	martin.panter	set	messages: + msg239363
2015-03-26 15:35:09	demian.brecht	set	messages: + msg239341
2015-03-26 11:26:37	martin.panter	set	files: + urljoin-non-hier.patch messages: + msg239324
2015-03-24 14:18:39	berker.peksag	set	messages: + msg239125 versions: - Python 3.3
2015-03-19 15:40:13	demian.brecht	set	messages: + msg238536
2015-03-19 15:36:23	demian.brecht	set	messages: + msg238534
2015-03-19 07:25:17	martin.panter	set	messages: + msg238497
2015-03-17 23:49:30	demian.brecht	set	stage: patch review versions: + Python 3.5
2015-03-17 23:47:46	demian.brecht	set	nosy: + demian.brecht messages: + msg238363
2014-12-17 04:28:13	martin.panter	set	files: + urljoin-scheme.patch keywords: + patch messages: + msg232799
2013-09-02 17:49:32	madison.may	set	messages: + msg196795
2013-09-02 17:39:09	madison.may	set	messages: + msg196794
2013-09-02 08:59:39	berker.peksag	set	nosy: + berker.peksag messages: + msg196775
2013-08-30 06:08:19	martin.panter	set	nosy: + martin.panter messages: + msg196515
2013-08-30 04:28:35	madison.may	set	components: + Library (Lib)
2013-08-26 19:26:46	ned.deily	set	nosy: + orsenthil
2013-08-26 17:34:18	madison.may	set	nosy: + madison.may messages: + msg196230
2013-08-25 12:52:00	mher	create