This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author steve.dower
Recipients eric.smith, steve.dower, valer
Date 2018-06-25.23:21:33
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1529968893.68.0.56676864532.issue33881@psf.upfronthosting.co.za>
In-reply-to
Content
The benchmark may not be triggering that much work. NFKC normalization only applies for characters outside of the basic Latin characters (0-255).

I ran the below benchmarks and saw a huge difference. Granted, it's a very degenerate case with collections this big, but it appears to be linear with len(NAMES), suggesting that the normalization is the expensive part.

>>> CHRS=[c for c in (chr(i) for i in range(65535)) if c.isidentifier()]
>>> def makename():
...  return ''.join(random.choice(CHRS) for _ in range(10))
...
>>> NAMES = [makename() for _ in range(10000)]
>>> timeit.timeit('len(set(NAMES))', globals=globals(), number=100000)
38.04007526000004
>>> timeit.timeit('len(set(unicodedata.normalize("NFKC", n) for n in NAMES))', globals=globals(), number=100000)
820.2586788580002

I wonder if it's better to catch the SyntaxError and do the check there? That way we don't really have a performance impact, since it's only going to show up in exceptional cases anyway.
History
Date User Action Args
2018-06-25 23:21:33steve.dowersetrecipients: + steve.dower, eric.smith, valer
2018-06-25 23:21:33steve.dowersetmessageid: <1529968893.68.0.56676864532.issue33881@psf.upfronthosting.co.za>
2018-06-25 23:21:33steve.dowerlinkissue33881 messages
2018-06-25 23:21:33steve.dowercreate