Message320454
The benchmark may not be triggering that much work. NFKC normalization only applies for characters outside of the basic Latin characters (0-255).
I ran the below benchmarks and saw a huge difference. Granted, it's a very degenerate case with collections this big, but it appears to be linear with len(NAMES), suggesting that the normalization is the expensive part.
>>> CHRS=[c for c in (chr(i) for i in range(65535)) if c.isidentifier()]
>>> def makename():
... return ''.join(random.choice(CHRS) for _ in range(10))
...
>>> NAMES = [makename() for _ in range(10000)]
>>> timeit.timeit('len(set(NAMES))', globals=globals(), number=100000)
38.04007526000004
>>> timeit.timeit('len(set(unicodedata.normalize("NFKC", n) for n in NAMES))', globals=globals(), number=100000)
820.2586788580002
I wonder if it's better to catch the SyntaxError and do the check there? That way we don't really have a performance impact, since it's only going to show up in exceptional cases anyway. |
|
Date |
User |
Action |
Args |
2018-06-25 23:21:33 | steve.dower | set | recipients:
+ steve.dower, eric.smith, valer |
2018-06-25 23:21:33 | steve.dower | set | messageid: <1529968893.68.0.56676864532.issue33881@psf.upfronthosting.co.za> |
2018-06-25 23:21:33 | steve.dower | link | issue33881 messages |
2018-06-25 23:21:33 | steve.dower | create | |
|