Author ezio.melotti
Recipients barry, ezio.melotti, loewis, nadeem.vawda, orsenthil, r.david.murray, rosslagerwall
Date 2012-09-16.05:28:05
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1347773287.07.0.254860780321.issue11454@psf.upfronthosting.co.za>
In-reply-to
Content
Given that high surrogates are U+D800..U+DBFF, and low ones are U+DC00..U+DFFF, '([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)' means "a low surrogates, preceded by either an high one or line beginning, and followed by another low one or line end".

PEP 838 says "With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF".

If I change the regex to _has_surrogates = re.compile('[\udc80-\udcff]').search, the tests still pass but there's no improvement on startup time (note: the previous regex was matching all the surrogates in this range too, however I'm not sure how well this is tested).

If I change the implementation with
_pep383_surrogates = set(map(chr, range(0xDC80, 0xDCFF+1)))
def _has_surrogates(s):
    return any(c in _pep383_surrogates for c in s)

the tests still pass and the startup is ~15ms faster here:

$ time ./python -m issue11454_imp2
[68837 refs]

real    0m0.305s
user    0m0.288s
sys     0m0.012s

However using this function instead of the regex is ~10x slower at runtime.  Using the shorter regex is about ~7x faster, but there are no improvements on the startup time.
Assuming the shorter regex is correct, it can still be called inside a function or used with functools.partial.  This will result in a improved startup time and a ~2x improvement on runtime (so it's a win-win).
See attached patch for benchmarks.

This is a sample result:
 17.01 usec/pass  <- re.compile(current_regex).search
  2.20 usec/pass  <- re.compile(short_regex).search
148.18 usec/pass  <- return any(c in surrogates for c in s)
106.35 usec/pass  <- for c in s: if c in surrogates: return True
  8.40 usec/pass  <- return re.search(short_regex, s)
  8.20 usec/pass  <- functools.partial(re.search, short_regex)
History
Date User Action Args
2012-09-16 05:28:07ezio.melottisetrecipients: + ezio.melotti, loewis, barry, orsenthil, nadeem.vawda, r.david.murray, rosslagerwall
2012-09-16 05:28:07ezio.melottisetmessageid: <1347773287.07.0.254860780321.issue11454@psf.upfronthosting.co.za>
2012-09-16 05:28:06ezio.melottilinkissue11454 messages
2012-09-16 05:28:06ezio.melotticreate