This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: re.LOCALE doesn't reflect locale.setlocale(...)
Type: behavior Stage: resolved
Components: Regular Expressions, Unicode Versions: Python 3.1, Python 3.2, Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, mrabarnett, r.david.murray, vbr
Priority: normal Keywords:

Created on 2011-04-03 01:51 by vbr, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (4)
msg132826 - (view) Author: Vlastimil Brom (vbr) Date: 2011-04-03 01:51
Hi,
I just noticed a behaviour of the re.LOCALE flag I can't understand; I first reported this to the new regex implementation, which, however, only mimics the standard lib re in this case:
http://code.google.com/p/mrab-regex-hg/issues/detail?id=6
I also couldn't find anything relevant in the tracker, other than some older, already fixed issues; I'm sorry, if I missed something.
I thought, the search pattern (?L)\w would match any of the respective string.letters according to the current locale (and possibly additionally [0-9_]).

However, the locale doesn't seem to be reflected in an expected way.

>>> unicode_BMP = " " + "".join(unichr(i)for i in range(1, 0x10000))
>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")
'Czech_Czech Republic.1250'
>>> import re
>>> print("".join(re.findall(r"(?L)\w", unicode_BMP)))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzŠŒŽšœžŸ£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
>>> locale.setlocale(locale.LC_ALL, "Greek")
'Greek_Greece.1253'
>>> print("".join(re.findall(r"(?L)\w", unicode_BMP)))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzƒ¢²³µ¸¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþ
>>> 

>>> unicode_BMP = " " + "".join(unichr(i)for i in range(1, 0x10000))

>>> locale.setlocale(locale.LC_ALL, "")
'Czech_Czech Republic.1250'
>>> print unicode(string.letters, "windows-1250")
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzŠŚŤŽŹšśťžźŁĄŞŻłµąşĽľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ
>>> locale.setlocale(locale.LC_ALL, "Greek")
'Greek_Greece.1253'
>>> print unicode(string.letters, "windows-1253")
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒΆµΈΉΊΌΎΏΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώ
>>> 

It seems that the nearest letter set to the result of the re/regex LOCALE flags migt be ascii or US locale:

>>> locale.setlocale(locale.LC_ALL, "US")
'English_United States.1252'
>>> print unicode(string.letters, "windows-1252")
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒŠŒŽšœžŸªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
>>> 

however, there are some differences too, namely between zƒ and À
re (?L)\w : 
Czech
zŠŒŽšœžŸ£¥ª¯³µ¹º¼¾¿À
Greek
zƒ¢²³µ¸¹º¼¾¿À
string.letters -- US locale
zƒŠŒŽšœžŸªµºÀ
(as displayed in tkinter Idle shell)
(in either case, there are some items, one wouldn't consider usual word characters, cf. ¿)

I am not sure whether there are no other issues (like some encoding/displaying peculiarities in Tkinter), but the re matching using the LOCALE flag don't reflect the locale.setlocale(...) in a transparent way.

Is it supposed to work this way and is there another possibility to get the expected locale aware matching, as one might expect according to:
http://docs.python.org/library/re.html#re.LOCALE
"""
Make \w, \W, \b, \B, \s and \S dependent on the current locale.
"""


using Python 2.7.1, 32 bit;  win 7 Home Premium 64-bit, Czech.

in Python 3.1.3 as well as 3.2 the result is the same (with the appropriately modified code): ...
>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")
'Czech_Czech Republic.1250'
>>> import re
>>> print("".join(re.findall(r"(?L)\w", unicode_BMP)))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzŠŒŽšœžŸ£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
>>> 

However, in Python 3, there is no comparison with string.letters available anymore.

Regards,
    Vlastimil Brom
msg132844 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-04-03 15:27
I don't know what re is doing with respect to locale, but I do know that the implementation of string.letters is at least somewhat broken in 2.x.  It has no useful meaning in unicode, which is why it doesn't exist in 3.x.

A standard that talks about regex and locale is here:

http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html

I don't know enough about locale or regex to comment further, but from the perspective of what I know about current developer resources and focus I would say that if anything is going to be changed, it would be by mrabarnett in the new engine.  Unless mrab (or you?) does it, the old engine is unlikely to be touched at this point.
msg132850 - (view) Author: Vlastimil Brom (vbr) Date: 2011-04-03 16:08
Thanks for the comment for string.letters and further reference.
Given, that Mr. Barnett mentioned in his tracker to regex ( http://code.google.com/p/mrab-regex-hg/issues/detail?id=6 ), that he only supports the LOCALE flag because of the compatibility with re and given my zero knowledge of C, I suppose, we will live with the status quo.
I guess, if there were a well defined source of "letters" for the given locales, the implementation wouldn't necessarily have to be be that complex (in the context of the regex code), but as there is probably no agreement in this respect (if string.letters is questionable), it becomes pointless.
After all, one can define a needed regex pattern manually, and mrab's regex library makes it much easier due to the support for unicode properties and others.

vbr
msg132891 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-04-03 21:53
Yeah, as far as I could tell from a brief scan of google hits, locale support in regex in general is a legacy thing, and the "correct" thing to do is to use unicode properties.  So I'll close this as won't fix.  If someone comes along with motivation to fix it it can always be reopened.
History
Date User Action Args
2022-04-11 14:57:15adminsetgithub: 55953
2011-04-03 21:53:59r.david.murraysetstatus: open -> closed
resolution: wont fix
messages: + msg132891

stage: resolved
2011-04-03 16:08:16vbrsetmessages: + msg132850
2011-04-03 15:27:30r.david.murraysetnosy: + r.david.murray
messages: + msg132844
2011-04-03 01:56:38ezio.melottisetnosy: + ezio.melotti, mrabarnett
2011-04-03 01:51:11vbrcreate