This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: string.lowercase and string.uppercase can contain garbage
Type: behavior Stage: resolved
Components: Unicode Versions: Python 2.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: Alexander.Pyhalov, ezio.melotti, pitrou, r.david.murray, skrah, vstinner
Priority: normal Keywords:

Created on 2013-12-21 21:38 by Alexander.Pyhalov, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (11)
msg206786 - (view) Author: Alexander Pyhalov (Alexander.Pyhalov) Date: 2013-12-21 21:38
When Python 2.6 (or 2.7) compiled with _XOPEN_SOURCE=600 on illumos  string.lowercase and string.uppercase contain garbage when UTF-8 locale is used. 
(OpenIndiana bug report - https://www.illumos.org/issues/4411 ).
The reason is that with UTF-8 locale islower()/isupper() and similar functions are not expected to work with non-ascii symbols. 
So, code like 

    n = 0;
    for (c = 0; c < 256; c++) {
        if (islower(c))
            buf[n++] = c;
    }

is expected to fail, because it calls islower on illegal UTF-8 symbols (with codes 128-255). It should be converted to something like

    n = 0;
    for (c = 0; c < 256; c++) {
        if (isascii(c) && islower(c))
            buf[n++] = c;
    }

or to 

    n = 0;
    for (c = 0; c < 128; c++) {
        if (islower(c))
            buf[n++] = c;
    }

Before doing this you should check if locale is UTF-8. However, almost all non-C locales on illumos are UTF-8. 


Example of incorrect behavior: 

Python 2.6.9 (unknown, Nov 12 2013, 13:54:48) 
[GCC 4.7.3] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> import string
>>> string.lowercase
'abcdefghijklmnopqrstuvwxyz\\xaa\\xb5\\xba\\xdf\\xe0\\xe1\\xe2\\xe3\\xe4\\xe5\\xe6\\xe7\\xe8\\xe9\\xea\\xeb\\xec\\xed\\xee\\xef\\xf0\\xf1\\xf2\\xf3\\xf4\\xf5\\xf6\\xf8\\xf9\\xfa\\xfb\\xfc\\xfd\\xfe\\xff'
>>> string.uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ\\xc0\\xc1\\xc2\\xc3\\xc4\\xc5\\xc6\\xc7\\xc8\\xc9\\xca\\xcb\\xcc\\xcd\\xce\\xcf\\xd0\\xd1\\xd2\\xd3\\xd4\\xd5\\xd6\\xd8\\xd9\\xda\\xdb\\xdc\\xdd\\xde'
>>>
msg206794 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-12-21 23:21
In python2, string.lowercase and string.uppercase are locale dependent.  This isn't really all that useful in practice, which is why it was dropped in Python3.  The proposed fix might be correct, *if* utf-8 is checked for (see, eg, Issue 6525), but...do you have any idea why this is a problem on illumos with _XOPEN_SOURCE=600 but not on any other platform (as far as we know)?  It seems like it would be a bug in the platform's islower and isupper functions, which are supposed to operate on integers that fit in an unsigned char, and be locale aware, according to the standards.
msg206795 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-12-21 23:25
> The reason is that with UTF-8 locale islower()/isupper() and similar
> functions are not expected to work with non-ascii symbols. 

Can you explain why?
msg206808 - (view) Author: Alexander Pyhalov (Alexander.Pyhalov) Date: 2013-12-22 07:00
Honestly, I don't understand locale-related things good enough. But I
received this explanation when discussed similar issue in illumos
developers mailing list.
http://comments.gmane.org/gmane.os.illumos.devel/14193

2013/12/22 Antoine Pitrou <report@bugs.python.org>

>
> Antoine Pitrou added the comment:
>
> > The reason is that with UTF-8 locale islower()/isupper() and similar
> > functions are not expected to work with non-ascii symbols.
>
> Can you explain why?
>
> ----------
> nosy: +pitrou
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue20049>
> _______________________________________
>
msg206809 - (view) Author: Alexander Pyhalov (Alexander.Pyhalov) Date: 2013-12-22 07:31
I've discussed this once more. 

From islower man page:

RETURN VALUES
     If the argument to any of the character handling  macros  is
     not  in the domain of the function, the result is undefined.

And (char)128-255 are not legal UTF-8 (at least what I see from wikipedia: http://en.wikipedia.org/wiki/UTF-8 ).
msg206814 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-12-22 14:37
> I've discussed this once more. 
> 
> >From islower man page:
> 
> RETURN VALUES
>      If the argument to any of the character handling  macros  is
>      not  in the domain of the function, the result is undefined.

This is not the wording of the POSIX spec:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/islower.html

"""The c argument is an int, the value of which the application shall
ensure is a character representable as an unsigned char or equal to the
value of the macro EOF."""

This means that any value between 0 and 255 ("representable as an
unsigned char") is a valid input for islower().

This would mean IllumOS deviates from the POSIX spec here. I would
suggest either fixing your libc's ctype.h implementation, and/or
patching your version of Python to workaround this issue.

Note the ISO C99 standard has the same wording as POSIX:

"""The header <ctype.h> declares several functions useful for
classifying and mapping characters. In all cases the argument is an int,
the value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF."""

(Note also that under Linux and most likely other Unices,
string.lowercase and string.uppercase work fine under a UTF-8 locale)
msg206816 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-12-22 14:48
To elaborate yet a bit, I agree with the following statement in the aforementioned [illumos-devel] discussion thread:

"""In further explanation, the isalpha() and friends *should* probably return false for the value 196, or any other byte with high order bit set, in UTF-8 locales."""
http://thread.gmane.org/gmane.os.illumos.devel/14193/focus=14206

I'll also point out that the code examples in the POSIX spec use islower() exactly like Python does (on arbitrary integers) between 0 and 255:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/islower.html

    c = (unsigned char) (rand() % 256);
...
    if (islower(c))
        keystr[len++] = c;
    }
...
msg206817 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-12-22 15:03
As to whether we will add a workaround for this in Python:

- Python follows POSIX correctly here, and no issue was reported in mainstream OSes such as Linux, OS X or the *BSDs

- this only exists in 2.7, which is in extended maintenance mode (it's the last of the 2.x series, and will probably stopped being maintained in a few years); Python 3.x doesn't have this issue

- IllumOS is a rather niche OS that none of us is using, so adding a system-specific workaround doesn't sound very compelling

Thanks for reporting, though. It's good to be reminded that locales and ctype.h are a rather lousy design :-)
msg206818 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2013-12-22 15:34
Alexander, the "domain fo the function" probably refers to
the range [-1, 256].

C99:
====

The header <ctype.h> declares several functions useful for classifying and mapping
characters.166) In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.
2 The behavior of these functions is affected by the current locale. Those functions that
 have locale-specific aspects only when not in the "C" locale are noted below.
3 The term printing character refers to a member of a locale-specific set of characters, each
 of which occupies one printing position on a display device; the term control character
refers to a member of a locale-specific set of characters that are not printing
characters.167) All letters and digits are printing characters.
Forward references: EOF (7.19.1), localization (7.11).
7.4.1 Character classification functions
1
The functions in this subclause return nonzero (true) if and only if the value of the
argument c conforms to that in the description of the function.


I think this agrees with what Antoine has said.
msg206819 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2013-12-22 15:35
IOW, I also support closing this issue. :)
msg206820 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-12-22 16:03
Yes, I definitely think this falls into the category of platform bugs, and we only maintain workarounds for those for "mainstream" OSes.  Others need to maintain their own local patches, just as for any other changes that are required to get Python working on those platforms.  (A platform's status can change over time, of course, but this is the category illumos currently falls into.)
History
Date User Action Args
2022-04-11 14:57:55adminsetgithub: 64248
2013-12-22 16:03:05r.david.murraysetstatus: open -> closed
resolution: rejected
messages: + msg206820

stage: resolved
2013-12-22 15:35:17skrahsetmessages: + msg206819
2013-12-22 15:34:02skrahsetnosy: + skrah
messages: + msg206818
2013-12-22 15:03:51pitrousetmessages: + msg206817
2013-12-22 14:48:53pitrousetmessages: + msg206816
2013-12-22 14:37:02pitrousetmessages: + msg206814
2013-12-22 07:31:05Alexander.Pyhalovsetmessages: + msg206809
2013-12-22 07:00:53Alexander.Pyhalovsetmessages: + msg206808
2013-12-21 23:25:13pitrousetnosy: + pitrou
messages: + msg206795
2013-12-21 23:21:36r.david.murraysetnosy: + r.david.murray
messages: + msg206794
2013-12-21 21:38:37Alexander.Pyhalovcreate