This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: re.IGNORECASE strips combining character from lower case of LATIN CAPITAL LETTER I WITH DOT ABOVE
Type: Stage:
Components: Library (Lib) Versions: Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: David MacIver, mrabarnett, tomviner
Priority: normal Keywords:

Created on 2017-08-13 13:43 by David MacIver, last changed 2022-04-11 14:58 by admin.

Files
File name Uploaded Description Edit
casing.py David MacIver, 2017-08-13 13:43
Messages (3)
msg300219 - (view) Author: David MacIver (David MacIver) * Date: 2017-08-13 13:43
chr(304).lower() is a two character string - a lower case i followed by a combining chr(775) ('COMBINING DOT ABOVE').

The re module seems not to understand the combining character and a regex compiled with IGNORECASE will erroneously match a single lower case i without the required combining character. The attached file demonstrates this. I've tested this on Python 3.6.1 with my locale as ('en_GB', 'UTF-8') (I don't know whether that matters for reproducing this, but I know it can affect how lower/upper work so am including it for the sake of completeness).

The problem does not reproduce on Python 2.7.13 because on that case chr(304).lower() is 'i' without the combining character, so it fails earlier.

This is presumably related to #12728, but as that is closed as fixed while this still reproduces I don't believe it's a duplicate.
msg300257 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2017-08-14 17:57
The re module works with codepoints, it doesn't understand canonical equivalence.

For example, it doesn't recognise that "\N{LATIN CAPITAL LETTER E}\N{COMBINING ACUTE ACCENT}" is equivalent to "\N{LATIN CAPITAL LETTER E WITH ACUTE}".

This is true for Python in general, except for identifiers, which are normalised:

>>> "\N{LATIN CAPITAL LETTER E}\N{COMBINING ACUTE ACCENT}"
'É'
>>> É = 0
>>> "\N{LATIN CAPITAL LETTER E WITH ACUTE}"
'É'
>>> É
0

This also means that, say '.' will match only 1 _codepoint_.
msg300258 - (view) Author: David MacIver (David MacIver) * Date: 2017-08-14 18:03
Sure, but 'i' is a single code point. The bug is that the regex matches 'i', not that it doesn't match the actual two codepoint lower case of the string.
History
Date User Action Args
2022-04-11 14:58:50adminsetgithub: 75376
2017-08-14 18:03:49David MacIversetmessages: + msg300258
2017-08-14 17:57:36mrabarnettsetnosy: + mrabarnett
messages: + msg300257
2017-08-13 23:05:56tomvinersetnosy: + tomviner
2017-08-13 13:43:47David MacIvercreate