Message 320454 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	steve.dower
Recipients	eric.smith, steve.dower, valer
Date	2018-06-25.23:21:33
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1529968893.68.0.56676864532.issue33881@psf.upfronthosting.co.za>
In-reply-to

Content
The benchmark may not be triggering that much work. NFKC normalization only applies for characters outside of the basic Latin characters (0-255). I ran the below benchmarks and saw a huge difference. Granted, it's a very degenerate case with collections this big, but it appears to be linear with len(NAMES), suggesting that the normalization is the expensive part. >>> CHRS=[c for c in (chr(i) for i in range(65535)) if c.isidentifier()] >>> def makename(): ... return ''.join(random.choice(CHRS) for _ in range(10)) ... >>> NAMES = [makename() for _ in range(10000)] >>> timeit.timeit('len(set(NAMES))', globals=globals(), number=100000) 38.04007526000004 >>> timeit.timeit('len(set(unicodedata.normalize("NFKC", n) for n in NAMES))', globals=globals(), number=100000) 820.2586788580002 I wonder if it's better to catch the SyntaxError and do the check there? That way we don't really have a performance impact, since it's only going to show up in exceptional cases anyway.

The benchmark may not be triggering that much work. NFKC normalization only applies for characters outside of the basic Latin characters (0-255).

I ran the below benchmarks and saw a huge difference. Granted, it's a very degenerate case with collections this big, but it appears to be linear with len(NAMES), suggesting that the normalization is the expensive part.

>>> CHRS=[c for c in (chr(i) for i in range(65535)) if c.isidentifier()]
>>> def makename():
...  return ''.join(random.choice(CHRS) for _ in range(10))
...
>>> NAMES = [makename() for _ in range(10000)]
>>> timeit.timeit('len(set(NAMES))', globals=globals(), number=100000)
38.04007526000004
>>> timeit.timeit('len(set(unicodedata.normalize("NFKC", n) for n in NAMES))', globals=globals(), number=100000)
820.2586788580002

I wonder if it's better to catch the SyntaxError and do the check there? That way we don't really have a performance impact, since it's only going to show up in exceptional cases anyway.

History
Date	User	Action	Args
2018-06-25 23:21:33	steve.dower	set	recipients: + steve.dower, eric.smith, valer
2018-06-25 23:21:33	steve.dower	set	messageid: <1529968893.68.0.56676864532.issue33881@psf.upfronthosting.co.za>
2018-06-25 23:21:33	steve.dower	link	issue33881 messages
2018-06-25 23:21:33	steve.dower	create