Message170549
Given that high surrogates are U+D800..U+DBFF, and low ones are U+DC00..U+DFFF, '([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)' means "a low surrogates, preceded by either an high one or line beginning, and followed by another low one or line end".
PEP 838 says "With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF".
If I change the regex to _has_surrogates = re.compile('[\udc80-\udcff]').search, the tests still pass but there's no improvement on startup time (note: the previous regex was matching all the surrogates in this range too, however I'm not sure how well this is tested).
If I change the implementation with
_pep383_surrogates = set(map(chr, range(0xDC80, 0xDCFF+1)))
def _has_surrogates(s):
return any(c in _pep383_surrogates for c in s)
the tests still pass and the startup is ~15ms faster here:
$ time ./python -m issue11454_imp2
[68837 refs]
real 0m0.305s
user 0m0.288s
sys 0m0.012s
However using this function instead of the regex is ~10x slower at runtime. Using the shorter regex is about ~7x faster, but there are no improvements on the startup time.
Assuming the shorter regex is correct, it can still be called inside a function or used with functools.partial. This will result in a improved startup time and a ~2x improvement on runtime (so it's a win-win).
See attached patch for benchmarks.
This is a sample result:
17.01 usec/pass <- re.compile(current_regex).search
2.20 usec/pass <- re.compile(short_regex).search
148.18 usec/pass <- return any(c in surrogates for c in s)
106.35 usec/pass <- for c in s: if c in surrogates: return True
8.40 usec/pass <- return re.search(short_regex, s)
8.20 usec/pass <- functools.partial(re.search, short_regex) |
|
Date |
User |
Action |
Args |
2012-09-16 05:28:07 | ezio.melotti | set | recipients:
+ ezio.melotti, loewis, barry, orsenthil, nadeem.vawda, r.david.murray, rosslagerwall |
2012-09-16 05:28:07 | ezio.melotti | set | messageid: <1347773287.07.0.254860780321.issue11454@psf.upfronthosting.co.za> |
2012-09-16 05:28:06 | ezio.melotti | link | issue11454 messages |
2012-09-16 05:28:06 | ezio.melotti | create | |
|