Issue 22891: code removal from urllib.parse.urlsplit()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67080

classification

Title:	code removal from urllib.parse.urlsplit()
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.5

process

Status:	closed	Resolution:	out of date
Dependencies:	27657	Superseder:
Assigned To:		Nosy List:	Alexander.Todorov, Tim.Graham, gregory.p.smith, martin.panter, miss-islington, orsenthil
Priority:	normal	Keywords:	patch

Created on 2014-11-17 10:42 by Alexander.Todorov, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
urlparse_refactor.patch	Alexander.Todorov, 2014-11-17 10:42	patch which removes unnecessary lines from urlparse()	review

Pull Requests
URL	Status	Linked	Edit
PR 661		Tim.Graham, 2017-03-14 13:55
PR 16837	merged	miss-islington, 2019-10-18 13:07
PR 16839	merged	orsenthil, 2019-10-18 13:51

Messages (10)
msg231278 - (view)	Author: Alexander Todorov (Alexander.Todorov) *	Date: 2014-11-17 10:42
In the urllib.parse (or urlparse on Python 2.X) module there is this function: 157 def urlsplit(url, scheme='', allow_fragments=True): 158 """Parse a URL into 5 components: 159 <scheme>://<netloc>/<path>?<query>#<fragment> 160 Return a 5-tuple: (scheme, netloc, path, query, fragment). 161 Note that we don't break the components up in smaller bits 162 (e.g. netloc is a single string) and we don't expand % escapes.""" 163 allow_fragments = bool(allow_fragments) 164 key = url, scheme, allow_fragments, type(url), type(scheme) 165 cached = _parse_cache.get(key, None) 166 if cached: 167 return cached 168 if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth 169 clear_cache() 170 netloc = query = fragment = '' 171 i = url.find(':') 172 if i > 0: 173 if url[:i] == 'http': # optimize the common case 174 scheme = url[:i].lower() 175 url = url[i+1:] 176 if url[:2] == '//': 177 netloc, url = _splitnetloc(url, 2) 178 if allow_fragments and '#' in url: 179 url, fragment = url.split('#', 1) 180 if '?' in url: 181 url, query = url.split('?', 1) 182 v = SplitResult(scheme, netloc, url, query, fragment) 183 _parse_cache[key] = v 184 return v 185 for c in url[:i]: 186 if c not in scheme_chars: 187 break 188 else: 189 scheme, url = url[:i].lower(), url[i+1:] 190 191 if url[:2] == '//': 192 netloc, url = _splitnetloc(url, 2) 193 if allow_fragments and '#' in url: 194 url, fragment = url.split('#', 1) 195 if '?' in url: 196 url, query = url.split('?', 1) 197 v = SplitResult(scheme, netloc, url, query, fragment) 198 _parse_cache[key] = v 199 return v There is an issue here (or a few of them) as follows: * if url[:1] is already lowercase (equals "http") (line 173) then .lower() on line 174 is reduntant: 174 scheme = url[:i].lower() # <--- no need for .lower() b/c value is "http" * OTOH line 173 could refactor the condition and match URLs where the scheme is uppercase. For example 173 if url[:i].lower() == 'http': # optimize the common case * The code as is returns the same results (as far as I've tested it) for both: urlsplit("http://github.com/atodorov/repo.git?param=value#myfragment") urlsplit("HTTP://github.com/atodorov/repo.git?param=value#myfragment") urlsplit("HTtP://github.com/atodorov/repo.git?param=value#myfragment") but the last 2 invocations also go through lines 185-199 * Lines 174-184 are essentially the same as lines 189-199. The only optimization I can see is avoiding the for loop around lines 185-187 which checks for valid characters in the URL scheme and executes only a few loops b/c scheme names are quite short usually. My personal vote goes for removal of lines 173-184. Version-Release number of selected component (if applicable): This is present in both Python 3 and Python 2 on all versions I have access to: python3-libs-3.4.1-16.fc21.x86_64.rpm python-libs-2.7.8-5.fc21.x86_64.rpm python-libs-2.7.5-16.el7.x86_64.rpm python-libs-2.6.6-52.el6.x86_64 Versions are from Fedora Rawhide and RHEL. Also the same code is present in the Mercurial repository. Bug first reported as https://bugzilla.redhat.com/show_bug.cgi?id=1160603 and now filing here for upstream consideration.
msg231283 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2014-11-17 14:23
Thanks for filing this. Does test cases pass after removal of those lines? (I will be surprised) and if the tests fail, then after analyzing it, this could only be considered a new change (thus change to made in latest python) in order not to break compatibility.
msg231309 - (view)	Author: Alexander Todorov (Alexander.Todorov) *	Date: 2014-11-18 08:53
> Does test cases pass after removal of those lines? (I will be surprised) Yes they do. Also see my detailed explanation above, there's no reason for test cases to fail.
msg238254 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-03-17 02:01
The patch seems sensible. The only behaviour change I can forsee would be the odd case of http:1234 no longer being parsed like this: >>> urlsplit("http:1234") SplitResult(scheme='http', netloc='', path='1234', query='', fragment='') Instead it would be parsed the same as HTTP:1234 (or tel:1234!): >>> urlsplit("HTTP:1234") SplitResult(scheme='', netloc='', path='HTTP:1234', query='', fragment='') If optimizing for “http:” really is important, it might still be done without the code duplication. Other options might be factoring a subroutine, using str.strip(), set.issubset(), or regular expressions.
msg271712 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-07-30 23:52
Issue 27657 has been opened about the quirk with numeric URLs
msg289586 - (view)	Author: Tim Graham (Tim.Graham) *	Date: 2017-03-14 13:55
I sent a pull request for this issue's dependency (issue 27657) and added a second commit there with this patch. A test added in the first commit demonstrates that removing this code no longer results in the behavior change described in msg238254.
msg354890 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2019-10-18 13:07
New changeset 5a88d50ff013a64fbdb25b877c87644a9034c969 by Senthil Kumaran (Tim Graham) in branch 'master': bpo-27657: Fix urlparse() with numeric paths (#661) https://github.com/python/cpython/commit/5a88d50ff013a64fbdb25b877c87644a9034c969
msg354895 - (view)	Author: miss-islington (miss-islington)	Date: 2019-10-18 13:24
New changeset 82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f by Miss Islington (bot) in branch '3.7': bpo-27657: Fix urlparse() with numeric paths (GH-661) https://github.com/python/cpython/commit/82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f
msg354904 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2019-10-18 15:23
New changeset 0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384 by Senthil Kumaran in branch '3.8': [3.8] bpo-27657: Fix urlparse() with numeric paths (GH-661) (#16839) https://github.com/python/cpython/commit/0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384
msg393915 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2021-05-19 01:37
There are still a lot of cleanliness and even logical issues lurking within urllib.parse. But this bug is dated and other changes have happened since. I'm closing it as out of date. Open new issues for specific action items.

History
Date	User	Action	Args
2022-04-11 14:58:10	admin	set	github: 67080
2021-05-19 01:37:03	gregory.p.smith	set	status: open -> closed nosy: + gregory.p.smith messages: + msg393915 resolution: out of date stage: patch review -> resolved
2019-10-18 15:23:21	orsenthil	set	messages: + msg354904
2019-10-18 13:51:49	orsenthil	set	pull_requests: + pull_request16391
2019-10-18 13:24:32	miss-islington	set	nosy: + miss-islington messages: + msg354895
2019-10-18 13:07:37	miss-islington	set	pull_requests: + pull_request16385
2019-10-18 13:07:36	orsenthil	set	messages: + msg354890
2017-03-14 13:55:02	Tim.Graham	set	nosy: + Tim.Graham messages: + msg289586 pull_requests: + pull_request549
2016-07-30 23:52:19	martin.panter	set	dependencies: + urlparse fails if the path is numeric messages: + msg271712
2015-03-17 02:01:15	martin.panter	set	nosy: + martin.panter messages: + msg238254
2014-11-18 08:53:29	Alexander.Todorov	set	messages: + msg231309
2014-11-17 14:23:34	orsenthil	set	messages: + msg231283
2014-11-17 13:28:24	berker.peksag	set	nosy: + orsenthil title: [patch]: code removal from urllib.parse.urlsplit() -> code removal from urllib.parse.urlsplit() stage: patch review versions: - Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.6
2014-11-17 10:42:24	Alexander.Todorov	create