classification
Title: code removal from urllib.parse.urlsplit()
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.5
process
Status: open Resolution:
Dependencies: 27657 Superseder:
Assigned To: Nosy List: Alexander.Todorov, Tim.Graham, martin.panter, orsenthil
Priority: normal Keywords: patch

Created on 2014-11-17 10:42 by Alexander.Todorov, last changed 2017-03-14 13:55 by Tim.Graham.

Files
File name Uploaded Description Edit
urlparse_refactor.patch Alexander.Todorov, 2014-11-17 10:42 patch which removes unnecessary lines from urlparse() review
Pull Requests
URL Status Linked Edit
PR 661 Tim.Graham, 2017-03-14 13:55
Messages (6)
msg231278 - (view) Author: Alexander Todorov (Alexander.Todorov) * Date: 2014-11-17 10:42
In the urllib.parse (or urlparse on Python 2.X) module there is this function:

    157 def urlsplit(url, scheme='', allow_fragments=True):
    158     """Parse a URL into 5 components:
    159     <scheme>://<netloc>/<path>?<query>#<fragment>
    160     Return a 5-tuple: (scheme, netloc, path, query, fragment).
    161     Note that we don't break the components up in smaller bits
    162     (e.g. netloc is a single string) and we don't expand % escapes."""
    163     allow_fragments = bool(allow_fragments)
    164     key = url, scheme, allow_fragments, type(url), type(scheme)
    165     cached = _parse_cache.get(key, None)
    166     if cached:
    167         return cached
    168     if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
    169         clear_cache()
    170     netloc = query = fragment = ''
    171     i = url.find(':')
    172     if i > 0:
    173         if url[:i] == 'http': # optimize the common case
    174             scheme = url[:i].lower()
    175             url = url[i+1:]
    176             if url[:2] == '//':
    177                 netloc, url = _splitnetloc(url, 2)
    178             if allow_fragments and '#' in url:
    179                 url, fragment = url.split('#', 1)
    180             if '?' in url:
    181                 url, query = url.split('?', 1)
    182             v = SplitResult(scheme, netloc, url, query, fragment)
    183             _parse_cache[key] = v
    184             return v
    185         for c in url[:i]:
    186             if c not in scheme_chars:
    187                 break
    188         else:
    189             scheme, url = url[:i].lower(), url[i+1:]
    190 
    191     if url[:2] == '//':
    192         netloc, url = _splitnetloc(url, 2)
    193     if allow_fragments and '#' in url:
    194         url, fragment = url.split('#', 1)
    195     if '?' in url:
    196         url, query = url.split('?', 1)
    197     v = SplitResult(scheme, netloc, url, query, fragment)
    198     _parse_cache[key] = v
    199     return v



There is an issue here (or a few of them) as follows:

* if url[:1] is already lowercase (equals "http") (line 173) then .lower() on  line 174 is reduntant:
174    scheme = url[:i].lower() # <--- no need for .lower() b/c value is "http"


* OTOH line 173 could refactor the condition and match URLs where the scheme is uppercase. For example
    173         if url[:i].lower() == 'http': # optimize the common case

* The code as is returns the same results (as far as I've tested it) for both:
urlsplit("http://github.com/atodorov/repo.git?param=value#myfragment")
urlsplit("HTTP://github.com/atodorov/repo.git?param=value#myfragment")
urlsplit("HTtP://github.com/atodorov/repo.git?param=value#myfragment")

but the last 2 invocations also go through lines 185-199


* Lines 174-184 are essentially the same as lines 189-199. The only optimization I can see is avoiding the for loop around lines 185-187 which checks for valid characters in the URL scheme and executes only a few loops b/c scheme names are quite short usually.


My personal vote goes for removal of lines 173-184. 


Version-Release number of selected component (if applicable):

This is present in both Python 3 and Python 2 on all versions I have access to:

python3-libs-3.4.1-16.fc21.x86_64.rpm
python-libs-2.7.8-5.fc21.x86_64.rpm
python-libs-2.7.5-16.el7.x86_64.rpm
python-libs-2.6.6-52.el6.x86_64

Versions are from Fedora Rawhide and RHEL. Also the same code is present in the Mercurial repository.

Bug first reported as
https://bugzilla.redhat.com/show_bug.cgi?id=1160603

and now filing here for upstream consideration.
msg231283 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2014-11-17 14:23
Thanks for filing this. Does test cases pass after removal of those lines? (I will be surprised) and if the tests fail, then after analyzing it, this could only be considered a new change (thus change to made in latest python) in order not to break compatibility.
msg231309 - (view) Author: Alexander Todorov (Alexander.Todorov) * Date: 2014-11-18 08:53
> Does test cases pass after removal of those lines? (I will be surprised) 

Yes they do. Also see my detailed explanation above, there's no reason for test cases to fail.
msg238254 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-03-17 02:01
The patch seems sensible. The only behaviour change I can forsee would be the odd case of http:1234 no longer being parsed like this:

>>> urlsplit("http:1234")
SplitResult(scheme='http', netloc='', path='1234', query='', fragment='')

Instead it would be parsed the same as HTTP:1234 (or tel:1234!):

>>> urlsplit("HTTP:1234")
SplitResult(scheme='', netloc='', path='HTTP:1234', query='', fragment='')

If optimizing for “http:” really is important, it might still be done without the code duplication. Other options might be factoring a subroutine, using str.strip(), set.issubset(), or regular expressions.
msg271712 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-07-30 23:52
Issue 27657 has been opened about the quirk with numeric URLs
msg289586 - (view) Author: Tim Graham (Tim.Graham) * Date: 2017-03-14 13:55
I sent a pull request for this issue's dependency (issue 27657) and added a second commit there with this patch. A test added in the first commit demonstrates that removing this code no longer results in the behavior change described in msg238254.
History
Date User Action Args
2017-03-14 13:55:02Tim.Grahamsetnosy: + Tim.Graham

messages: + msg289586
pull_requests: + pull_request549
2016-07-30 23:52:19martin.pantersetdependencies: + urlparse fails if the path is numeric
messages: + msg271712
2015-03-17 02:01:15martin.pantersetnosy: + martin.panter
messages: + msg238254
2014-11-18 08:53:29Alexander.Todorovsetmessages: + msg231309
2014-11-17 14:23:34orsenthilsetmessages: + msg231283
2014-11-17 13:28:24berker.peksagsetnosy: + orsenthil
title: [patch]: code removal from urllib.parse.urlsplit() -> code removal from urllib.parse.urlsplit()

stage: patch review
versions: - Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.6
2014-11-17 10:42:24Alexander.Todorovcreate