Issue20049
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2013-12-21 21:38 by Alexander.Pyhalov, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Messages (11) | |||
---|---|---|---|
msg206786 - (view) | Author: Alexander Pyhalov (Alexander.Pyhalov) | Date: 2013-12-21 21:38 | |
When Python 2.6 (or 2.7) compiled with _XOPEN_SOURCE=600 on illumos string.lowercase and string.uppercase contain garbage when UTF-8 locale is used. (OpenIndiana bug report - https://www.illumos.org/issues/4411 ). The reason is that with UTF-8 locale islower()/isupper() and similar functions are not expected to work with non-ascii symbols. So, code like n = 0; for (c = 0; c < 256; c++) { if (islower(c)) buf[n++] = c; } is expected to fail, because it calls islower on illegal UTF-8 symbols (with codes 128-255). It should be converted to something like n = 0; for (c = 0; c < 256; c++) { if (isascii(c) && islower(c)) buf[n++] = c; } or to n = 0; for (c = 0; c < 128; c++) { if (islower(c)) buf[n++] = c; } Before doing this you should check if locale is UTF-8. However, almost all non-C locales on illumos are UTF-8. Example of incorrect behavior: Python 2.6.9 (unknown, Nov 12 2013, 13:54:48) [GCC 4.7.3] on sunos5 Type "help", "copyright", "credits" or "license" for more information. >>> import string >>> string.lowercase 'abcdefghijklmnopqrstuvwxyz\\xaa\\xb5\\xba\\xdf\\xe0\\xe1\\xe2\\xe3\\xe4\\xe5\\xe6\\xe7\\xe8\\xe9\\xea\\xeb\\xec\\xed\\xee\\xef\\xf0\\xf1\\xf2\\xf3\\xf4\\xf5\\xf6\\xf8\\xf9\\xfa\\xfb\\xfc\\xfd\\xfe\\xff' >>> string.uppercase 'ABCDEFGHIJKLMNOPQRSTUVWXYZ\\xc0\\xc1\\xc2\\xc3\\xc4\\xc5\\xc6\\xc7\\xc8\\xc9\\xca\\xcb\\xcc\\xcd\\xce\\xcf\\xd0\\xd1\\xd2\\xd3\\xd4\\xd5\\xd6\\xd8\\xd9\\xda\\xdb\\xdc\\xdd\\xde' >>> |
|||
msg206794 - (view) | Author: R. David Murray (r.david.murray) * | Date: 2013-12-21 23:21 | |
In python2, string.lowercase and string.uppercase are locale dependent. This isn't really all that useful in practice, which is why it was dropped in Python3. The proposed fix might be correct, *if* utf-8 is checked for (see, eg, Issue 6525), but...do you have any idea why this is a problem on illumos with _XOPEN_SOURCE=600 but not on any other platform (as far as we know)? It seems like it would be a bug in the platform's islower and isupper functions, which are supposed to operate on integers that fit in an unsigned char, and be locale aware, according to the standards. |
|||
msg206795 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2013-12-21 23:25 | |
> The reason is that with UTF-8 locale islower()/isupper() and similar > functions are not expected to work with non-ascii symbols. Can you explain why? |
|||
msg206808 - (view) | Author: Alexander Pyhalov (Alexander.Pyhalov) | Date: 2013-12-22 07:00 | |
Honestly, I don't understand locale-related things good enough. But I received this explanation when discussed similar issue in illumos developers mailing list. http://comments.gmane.org/gmane.os.illumos.devel/14193 2013/12/22 Antoine Pitrou <report@bugs.python.org> > > Antoine Pitrou added the comment: > > > The reason is that with UTF-8 locale islower()/isupper() and similar > > functions are not expected to work with non-ascii symbols. > > Can you explain why? > > ---------- > nosy: +pitrou > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue20049> > _______________________________________ > |
|||
msg206809 - (view) | Author: Alexander Pyhalov (Alexander.Pyhalov) | Date: 2013-12-22 07:31 | |
I've discussed this once more. From islower man page: RETURN VALUES If the argument to any of the character handling macros is not in the domain of the function, the result is undefined. And (char)128-255 are not legal UTF-8 (at least what I see from wikipedia: http://en.wikipedia.org/wiki/UTF-8 ). |
|||
msg206814 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2013-12-22 14:37 | |
> I've discussed this once more. > > >From islower man page: > > RETURN VALUES > If the argument to any of the character handling macros is > not in the domain of the function, the result is undefined. This is not the wording of the POSIX spec: http://pubs.opengroup.org/onlinepubs/9699919799/functions/islower.html """The c argument is an int, the value of which the application shall ensure is a character representable as an unsigned char or equal to the value of the macro EOF.""" This means that any value between 0 and 255 ("representable as an unsigned char") is a valid input for islower(). This would mean IllumOS deviates from the POSIX spec here. I would suggest either fixing your libc's ctype.h implementation, and/or patching your version of Python to workaround this issue. Note the ISO C99 standard has the same wording as POSIX: """The header <ctype.h> declares several functions useful for classifying and mapping characters. In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF.""" (Note also that under Linux and most likely other Unices, string.lowercase and string.uppercase work fine under a UTF-8 locale) |
|||
msg206816 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2013-12-22 14:48 | |
To elaborate yet a bit, I agree with the following statement in the aforementioned [illumos-devel] discussion thread: """In further explanation, the isalpha() and friends *should* probably return false for the value 196, or any other byte with high order bit set, in UTF-8 locales.""" http://thread.gmane.org/gmane.os.illumos.devel/14193/focus=14206 I'll also point out that the code examples in the POSIX spec use islower() exactly like Python does (on arbitrary integers) between 0 and 255: http://pubs.opengroup.org/onlinepubs/9699919799/functions/islower.html c = (unsigned char) (rand() % 256); ... if (islower(c)) keystr[len++] = c; } ... |
|||
msg206817 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2013-12-22 15:03 | |
As to whether we will add a workaround for this in Python: - Python follows POSIX correctly here, and no issue was reported in mainstream OSes such as Linux, OS X or the *BSDs - this only exists in 2.7, which is in extended maintenance mode (it's the last of the 2.x series, and will probably stopped being maintained in a few years); Python 3.x doesn't have this issue - IllumOS is a rather niche OS that none of us is using, so adding a system-specific workaround doesn't sound very compelling Thanks for reporting, though. It's good to be reminded that locales and ctype.h are a rather lousy design :-) |
|||
msg206818 - (view) | Author: Stefan Krah (skrah) * | Date: 2013-12-22 15:34 | |
Alexander, the "domain fo the function" probably refers to the range [-1, 256]. C99: ==== The header <ctype.h> declares several functions useful for classifying and mapping characters.166) In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined. 2 The behavior of these functions is affected by the current locale. Those functions that have locale-specific aspects only when not in the "C" locale are noted below. 3 The term printing character refers to a member of a locale-specific set of characters, each of which occupies one printing position on a display device; the term control character refers to a member of a locale-specific set of characters that are not printing characters.167) All letters and digits are printing characters. Forward references: EOF (7.19.1), localization (7.11). 7.4.1 Character classification functions 1 The functions in this subclause return nonzero (true) if and only if the value of the argument c conforms to that in the description of the function. I think this agrees with what Antoine has said. |
|||
msg206819 - (view) | Author: Stefan Krah (skrah) * | Date: 2013-12-22 15:35 | |
IOW, I also support closing this issue. :) |
|||
msg206820 - (view) | Author: R. David Murray (r.david.murray) * | Date: 2013-12-22 16:03 | |
Yes, I definitely think this falls into the category of platform bugs, and we only maintain workarounds for those for "mainstream" OSes. Others need to maintain their own local patches, just as for any other changes that are required to get Python working on those platforms. (A platform's status can change over time, of course, but this is the category illumos currently falls into.) |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:55 | admin | set | github: 64248 |
2013-12-22 16:03:05 | r.david.murray | set | status: open -> closed resolution: rejected messages: + msg206820 stage: resolved |
2013-12-22 15:35:17 | skrah | set | messages: + msg206819 |
2013-12-22 15:34:02 | skrah | set | nosy:
+ skrah messages: + msg206818 |
2013-12-22 15:03:51 | pitrou | set | messages: + msg206817 |
2013-12-22 14:48:53 | pitrou | set | messages: + msg206816 |
2013-12-22 14:37:02 | pitrou | set | messages: + msg206814 |
2013-12-22 07:31:05 | Alexander.Pyhalov | set | messages: + msg206809 |
2013-12-22 07:00:53 | Alexander.Pyhalov | set | messages: + msg206808 |
2013-12-21 23:25:13 | pitrou | set | nosy:
+ pitrou messages: + msg206795 |
2013-12-21 23:21:36 | r.david.murray | set | nosy:
+ r.david.murray messages: + msg206794 |
2013-12-21 21:38:37 | Alexander.Pyhalov | create |