Message 300257 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mrabarnett
Recipients	David MacIver, mrabarnett, tomviner
Date	2017-08-14.17:57:36
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1502733456.99.0.422932492964.issue31193@psf.upfronthosting.co.za>
In-reply-to

Content
The re module works with codepoints, it doesn't understand canonical equivalence. For example, it doesn't recognise that "\N{LATIN CAPITAL LETTER E}\N{COMBINING ACUTE ACCENT}" is equivalent to "\N{LATIN CAPITAL LETTER E WITH ACUTE}". This is true for Python in general, except for identifiers, which are normalised: >>> "\N{LATIN CAPITAL LETTER E}\N{COMBINING ACUTE ACCENT}" 'É' >>> É = 0 >>> "\N{LATIN CAPITAL LETTER E WITH ACUTE}" 'É' >>> É 0 This also means that, say '.' will match only 1 _codepoint_.

The re module works with codepoints, it doesn't understand canonical equivalence.

For example, it doesn't recognise that "\N{LATIN CAPITAL LETTER E}\N{COMBINING ACUTE ACCENT}" is equivalent to "\N{LATIN CAPITAL LETTER E WITH ACUTE}".

This is true for Python in general, except for identifiers, which are normalised:

>>> "\N{LATIN CAPITAL LETTER E}\N{COMBINING ACUTE ACCENT}"
'É'
>>> É = 0
>>> "\N{LATIN CAPITAL LETTER E WITH ACUTE}"
'É'
>>> É
0

This also means that, say '.' will match only 1 _codepoint_.

History
Date	User	Action	Args
2017-08-14 17:57:37	mrabarnett	set	recipients: + mrabarnett, David MacIver, tomviner
2017-08-14 17:57:36	mrabarnett	set	messageid: <1502733456.99.0.422932492964.issue31193@psf.upfronthosting.co.za>
2017-08-14 17:57:36	mrabarnett	link	issue31193 messages
2017-08-14 17:57:36	mrabarnett	create