classification
Title: code removal from urllib.parse.urlsplit()
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.5
process
Status: closed Resolution: out of date
Dependencies: 27657 Superseder:
Assigned To: Nosy List: Alexander.Todorov, Tim.Graham, gregory.p.smith, martin.panter, miss-islington, orsenthil
Priority: normal Keywords: patch

Created on 2014-11-17 10:42 by Alexander.Todorov, last changed 2021-05-19 01:37 by gregory.p.smith. This issue is now closed.

Files
File name Uploaded Description Edit
urlparse_refactor.patch Alexander.Todorov, 2014-11-17 10:42 patch which removes unnecessary lines from urlparse() review
Pull Requests
URL Status Linked Edit
PR 661 Tim.Graham, 2017-03-14 13:55
PR 16837 merged miss-islington, 2019-10-18 13:07
PR 16839 merged orsenthil, 2019-10-18 13:51
Messages (10)
msg231278 - (view) Author: Alexander Todorov (Alexander.Todorov) * Date: 2014-11-17 10:42
In the urllib.parse (or urlparse on Python 2.X) module there is this function:

    157 def urlsplit(url, scheme='', allow_fragments=True):
    158     """Parse a URL into 5 components:
    159     <scheme>://<netloc>/<path>?<query>#<fragment>
    160     Return a 5-tuple: (scheme, netloc, path, query, fragment).
    161     Note that we don't break the components up in smaller bits
    162     (e.g. netloc is a single string) and we don't expand % escapes."""
    163     allow_fragments = bool(allow_fragments)
    164     key = url, scheme, allow_fragments, type(url), type(scheme)
    165     cached = _parse_cache.get(key, None)
    166     if cached:
    167         return cached
    168     if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
    169         clear_cache()
    170     netloc = query = fragment = ''
    171     i = url.find(':')
    172     if i > 0:
    173         if url[:i] == 'http': # optimize the common case
    174             scheme = url[:i].lower()
    175             url = url[i+1:]
    176             if url[:2] == '//':
    177                 netloc, url = _splitnetloc(url, 2)
    178             if allow_fragments and '#' in url:
    179                 url, fragment = url.split('#', 1)
    180             if '?' in url:
    181                 url, query = url.split('?', 1)
    182             v = SplitResult(scheme, netloc, url, query, fragment)
    183             _parse_cache[key] = v
    184             return v
    185         for c in url[:i]:
    186             if c not in scheme_chars:
    187                 break
    188         else:
    189             scheme, url = url[:i].lower(), url[i+1:]
    190 
    191     if url[:2] == '//':
    192         netloc, url = _splitnetloc(url, 2)
    193     if allow_fragments and '#' in url:
    194         url, fragment = url.split('#', 1)
    195     if '?' in url:
    196         url, query = url.split('?', 1)
    197     v = SplitResult(scheme, netloc, url, query, fragment)
    198     _parse_cache[key] = v
    199     return v



There is an issue here (or a few of them) as follows:

* if url[:1] is already lowercase (equals "http") (line 173) then .lower() on  line 174 is reduntant:
174    scheme = url[:i].lower() # <--- no need for .lower() b/c value is "http"


* OTOH line 173 could refactor the condition and match URLs where the scheme is uppercase. For example
    173         if url[:i].lower() == 'http': # optimize the common case

* The code as is returns the same results (as far as I've tested it) for both:
urlsplit("http://github.com/atodorov/repo.git?param=value#myfragment")
urlsplit("HTTP://github.com/atodorov/repo.git?param=value#myfragment")
urlsplit("HTtP://github.com/atodorov/repo.git?param=value#myfragment")

but the last 2 invocations also go through lines 185-199


* Lines 174-184 are essentially the same as lines 189-199. The only optimization I can see is avoiding the for loop around lines 185-187 which checks for valid characters in the URL scheme and executes only a few loops b/c scheme names are quite short usually.


My personal vote goes for removal of lines 173-184. 


Version-Release number of selected component (if applicable):

This is present in both Python 3 and Python 2 on all versions I have access to:

python3-libs-3.4.1-16.fc21.x86_64.rpm
python-libs-2.7.8-5.fc21.x86_64.rpm
python-libs-2.7.5-16.el7.x86_64.rpm
python-libs-2.6.6-52.el6.x86_64

Versions are from Fedora Rawhide and RHEL. Also the same code is present in the Mercurial repository.

Bug first reported as
https://bugzilla.redhat.com/show_bug.cgi?id=1160603

and now filing here for upstream consideration.
msg231283 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2014-11-17 14:23
Thanks for filing this. Does test cases pass after removal of those lines? (I will be surprised) and if the tests fail, then after analyzing it, this could only be considered a new change (thus change to made in latest python) in order not to break compatibility.
msg231309 - (view) Author: Alexander Todorov (Alexander.Todorov) * Date: 2014-11-18 08:53
> Does test cases pass after removal of those lines? (I will be surprised) 

Yes they do. Also see my detailed explanation above, there's no reason for test cases to fail.
msg238254 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-03-17 02:01
The patch seems sensible. The only behaviour change I can forsee would be the odd case of http:1234 no longer being parsed like this:

>>> urlsplit("http:1234")
SplitResult(scheme='http', netloc='', path='1234', query='', fragment='')

Instead it would be parsed the same as HTTP:1234 (or tel:1234!):

>>> urlsplit("HTTP:1234")
SplitResult(scheme='', netloc='', path='HTTP:1234', query='', fragment='')

If optimizing for “http:” really is important, it might still be done without the code duplication. Other options might be factoring a subroutine, using str.strip(), set.issubset(), or regular expressions.
msg271712 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-07-30 23:52
Issue 27657 has been opened about the quirk with numeric URLs
msg289586 - (view) Author: Tim Graham (Tim.Graham) * Date: 2017-03-14 13:55
I sent a pull request for this issue's dependency (issue 27657) and added a second commit there with this patch. A test added in the first commit demonstrates that removing this code no longer results in the behavior change described in msg238254.
msg354890 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2019-10-18 13:07
New changeset 5a88d50ff013a64fbdb25b877c87644a9034c969 by Senthil Kumaran (Tim Graham) in branch 'master':
bpo-27657: Fix urlparse() with numeric paths (#661)
https://github.com/python/cpython/commit/5a88d50ff013a64fbdb25b877c87644a9034c969
msg354895 - (view) Author: miss-islington (miss-islington) Date: 2019-10-18 13:24
New changeset 82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f by Miss Islington (bot) in branch '3.7':
bpo-27657: Fix urlparse() with numeric paths (GH-661)
https://github.com/python/cpython/commit/82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f
msg354904 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2019-10-18 15:23
New changeset 0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384 by Senthil Kumaran in branch '3.8':
[3.8] bpo-27657: Fix urlparse() with numeric paths (GH-661) (#16839)
https://github.com/python/cpython/commit/0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384
msg393915 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2021-05-19 01:37
There are still a lot of cleanliness and even logical issues lurking within urllib.parse.  But this bug is dated and other changes have happened since.  I'm closing it as out of date.  Open new issues for specific action items.
History
Date User Action Args
2021-05-19 01:37:03gregory.p.smithsetstatus: open -> closed

nosy: + gregory.p.smith
messages: + msg393915

resolution: out of date
stage: patch review -> resolved
2019-10-18 15:23:21orsenthilsetmessages: + msg354904
2019-10-18 13:51:49orsenthilsetpull_requests: + pull_request16391
2019-10-18 13:24:32miss-islingtonsetnosy: + miss-islington
messages: + msg354895
2019-10-18 13:07:37miss-islingtonsetpull_requests: + pull_request16385
2019-10-18 13:07:36orsenthilsetmessages: + msg354890
2017-03-14 13:55:02Tim.Grahamsetnosy: + Tim.Graham

messages: + msg289586
pull_requests: + pull_request549
2016-07-30 23:52:19martin.pantersetdependencies: + urlparse fails if the path is numeric
messages: + msg271712
2015-03-17 02:01:15martin.pantersetnosy: + martin.panter
messages: + msg238254
2014-11-18 08:53:29Alexander.Todorovsetmessages: + msg231309
2014-11-17 14:23:34orsenthilsetmessages: + msg231283
2014-11-17 13:28:24berker.peksagsetnosy: + orsenthil
title: [patch]: code removal from urllib.parse.urlsplit() -> code removal from urllib.parse.urlsplit()

stage: patch review
versions: - Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.6
2014-11-17 10:42:24Alexander.Todorovcreate