This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ezio.melotti
Recipients barry, ezio.melotti, loewis, nadeem.vawda, orsenthil, r.david.murray, rosslagerwall
Date 2012-09-16.04:23:47
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1347769432.06.0.188098961985.issue11454@psf.upfronthosting.co.za>
In-reply-to
Content
I tried to remove a few unused regex and inline some of the others (the re module has its own caching anyway and they don't seem to be documented), but it didn't get so much faster (see attached patch).  

I then put the second list of email imports of the previous message in a file and run it with cprofile and these are the results:

=== Without patch ===

$ time ./python -m issue11454_imp2
[69308 refs]

real    0m0.337s
user    0m0.312s
sys     0m0.020s

$ ./python -m cProfile -s time issue11454_imp2.py
         15130 function calls (14543 primitive calls) in 0.191 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       26    0.029    0.001    0.029    0.001 {built-in method loads}
     1248    0.015    0.000    0.018    0.000 sre_parse.py:184(__next)
        3    0.010    0.003    0.015    0.005 sre_compile.py:301(_optimize_unicode)
    48/17    0.009    0.000    0.037    0.002 sre_parse.py:418(_parse)
     30/1    0.008    0.000    0.191    0.191 {built-in method exec}
       82    0.007    0.000    0.024    0.000 {built-in method __build_class__}
       25    0.006    0.000    0.024    0.001 sre_compile.py:207(_optimize_charset)
        8    0.005    0.001    0.005    0.001 {built-in method load_dynamic}
     1122    0.005    0.000    0.022    0.000 sre_parse.py:209(get)
      177    0.005    0.000    0.005    0.000 {built-in method stat}
      107    0.005    0.000    0.012    0.000 <frozen importlib._bootstrap>:1350(find_loader)
2944/2919    0.004    0.000    0.004    0.000 {built-in method len}
    69/15    0.003    0.000    0.028    0.002 sre_compile.py:32(_compile)
        9    0.003    0.000    0.003    0.000 sre_compile.py:258(_mk_bitmap)
       94    0.002    0.000    0.003    0.000 <frozen importlib._bootstrap>:74(_path_join)


=== With patch ===

$ time ./python -m issue11454_imp2
[69117 refs]

real    0m0.319s
user    0m0.304s
sys     0m0.012s

$ ./python -m cProfile -s time issue11454_imp2.py
         11281 function calls (10762 primitive calls) in 0.162 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       21    0.022    0.001    0.022    0.001 {built-in method loads}
        3    0.011    0.004    0.015    0.005 sre_compile.py:301(_optimize_unicode)
      708    0.008    0.000    0.010    0.000 sre_parse.py:184(__next)
     30/1    0.008    0.000    0.238    0.238 {built-in method exec}
       82    0.007    0.000    0.023    0.000 {built-in method __build_class__}
      187    0.005    0.000    0.005    0.000 {built-in method stat}
        8    0.005    0.001    0.005    0.001 {built-in method load_dynamic}
      107    0.005    0.000    0.012    0.000 <frozen importlib._bootstrap>:1350(find_loader)
     29/8    0.005    0.000    0.020    0.002 sre_parse.py:418(_parse)
       11    0.004    0.000    0.020    0.002 sre_compile.py:207(_optimize_charset)
      643    0.003    0.000    0.012    0.000 sre_parse.py:209(get)
        5    0.003    0.001    0.003    0.001 {built-in method dumps}
       94    0.002    0.000    0.003    0.000 <frozen importlib._bootstrap>:74(_path_join)
      257    0.002    0.000    0.002    0.000 quoprimime.py:56(<genexpr>)
       26    0.002    0.000    0.116    0.004 <frozen importlib._bootstrap>:938(get_code)
1689/1676    0.002    0.000    0.002    0.000 {built-in method len}
       31    0.002    0.000    0.003    0.000 <frozen importlib._bootstrap>:1034(get_data)
      256    0.002    0.000    0.002    0.000 {method 'setdefault' of 'dict' objects}
      119    0.002    0.000    0.003    0.000 <frozen importlib._bootstrap>:86(_path_split)
       35    0.002    0.000    0.019    0.001 <frozen importlib._bootstrap>:1468(_find_module)
       34    0.002    0.000    0.015    0.000 <frozen importlib._bootstrap>:1278(_get_loader)
     39/6    0.002    0.000    0.023    0.004 sre_compile.py:32(_compile)
     26/3    0.001    0.000    0.235    0.078 <frozen importlib._bootstrap>:853(_load_module)


The time spent in sre_compile.py:301(_optimize_unicode) most likely comes from email.utils._has_surrogates (there's a further speedup when it's commented away):
    _has_surrogates = re.compile('([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)').search

This is used in a number of places, so it can't be inlined.  I wanted to optimize it but I'm not sure what it's supposed to do.  It matches lone low surrogates, but not lone high ones, and matches some invalid sequences, but not others:
>>> _has_surrogates('\ud800')  # lone high
>>> _has_surrogates('\udc00')  # lone low
<_sre.SRE_Match object at 0x9ae00e8>
>>> _has_surrogates('\ud800\udc00')  # valid pair (high+low)
>>> _has_surrogates('\ud800\ud800\udc00')  # invalid sequence (lone high, valid high+low)
>>> _has_surrogates('\udc00\ud800\ud800\udc00')  # invalid sequence (lone low, lone high, valid high+low)
<_sre.SRE_Match object at 0x9ae0028>

FWIW this was introduced in email.message in 1a041f364916 and then moved to email.util in 9388c671d52d.
History
Date User Action Args
2012-09-16 04:23:52ezio.melottisetrecipients: + ezio.melotti, loewis, barry, orsenthil, nadeem.vawda, r.david.murray, rosslagerwall
2012-09-16 04:23:52ezio.melottisetmessageid: <1347769432.06.0.188098961985.issue11454@psf.upfronthosting.co.za>
2012-09-16 04:23:51ezio.melottilinkissue11454 messages
2012-09-16 04:23:49ezio.melotticreate