This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Alexander.Todorov
Recipients Alexander.Todorov
Date 2014-11-17.10:42:23
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1416220944.83.0.361625539874.issue22891@psf.upfronthosting.co.za>
In-reply-to
Content
In the urllib.parse (or urlparse on Python 2.X) module there is this function:

    157 def urlsplit(url, scheme='', allow_fragments=True):
    158     """Parse a URL into 5 components:
    159     <scheme>://<netloc>/<path>?<query>#<fragment>
    160     Return a 5-tuple: (scheme, netloc, path, query, fragment).
    161     Note that we don't break the components up in smaller bits
    162     (e.g. netloc is a single string) and we don't expand % escapes."""
    163     allow_fragments = bool(allow_fragments)
    164     key = url, scheme, allow_fragments, type(url), type(scheme)
    165     cached = _parse_cache.get(key, None)
    166     if cached:
    167         return cached
    168     if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
    169         clear_cache()
    170     netloc = query = fragment = ''
    171     i = url.find(':')
    172     if i > 0:
    173         if url[:i] == 'http': # optimize the common case
    174             scheme = url[:i].lower()
    175             url = url[i+1:]
    176             if url[:2] == '//':
    177                 netloc, url = _splitnetloc(url, 2)
    178             if allow_fragments and '#' in url:
    179                 url, fragment = url.split('#', 1)
    180             if '?' in url:
    181                 url, query = url.split('?', 1)
    182             v = SplitResult(scheme, netloc, url, query, fragment)
    183             _parse_cache[key] = v
    184             return v
    185         for c in url[:i]:
    186             if c not in scheme_chars:
    187                 break
    188         else:
    189             scheme, url = url[:i].lower(), url[i+1:]
    190 
    191     if url[:2] == '//':
    192         netloc, url = _splitnetloc(url, 2)
    193     if allow_fragments and '#' in url:
    194         url, fragment = url.split('#', 1)
    195     if '?' in url:
    196         url, query = url.split('?', 1)
    197     v = SplitResult(scheme, netloc, url, query, fragment)
    198     _parse_cache[key] = v
    199     return v



There is an issue here (or a few of them) as follows:

* if url[:1] is already lowercase (equals "http") (line 173) then .lower() on  line 174 is reduntant:
174    scheme = url[:i].lower() # <--- no need for .lower() b/c value is "http"


* OTOH line 173 could refactor the condition and match URLs where the scheme is uppercase. For example
    173         if url[:i].lower() == 'http': # optimize the common case

* The code as is returns the same results (as far as I've tested it) for both:
urlsplit("http://github.com/atodorov/repo.git?param=value#myfragment")
urlsplit("HTTP://github.com/atodorov/repo.git?param=value#myfragment")
urlsplit("HTtP://github.com/atodorov/repo.git?param=value#myfragment")

but the last 2 invocations also go through lines 185-199


* Lines 174-184 are essentially the same as lines 189-199. The only optimization I can see is avoiding the for loop around lines 185-187 which checks for valid characters in the URL scheme and executes only a few loops b/c scheme names are quite short usually.


My personal vote goes for removal of lines 173-184. 


Version-Release number of selected component (if applicable):

This is present in both Python 3 and Python 2 on all versions I have access to:

python3-libs-3.4.1-16.fc21.x86_64.rpm
python-libs-2.7.8-5.fc21.x86_64.rpm
python-libs-2.7.5-16.el7.x86_64.rpm
python-libs-2.6.6-52.el6.x86_64

Versions are from Fedora Rawhide and RHEL. Also the same code is present in the Mercurial repository.

Bug first reported as
https://bugzilla.redhat.com/show_bug.cgi?id=1160603

and now filing here for upstream consideration.
History
Date User Action Args
2014-11-17 10:42:24Alexander.Todorovsetrecipients: + Alexander.Todorov
2014-11-17 10:42:24Alexander.Todorovsetmessageid: <1416220944.83.0.361625539874.issue22891@psf.upfronthosting.co.za>
2014-11-17 10:42:24Alexander.Todorovlinkissue22891 messages
2014-11-17 10:42:23Alexander.Todorovcreate