classification
Title: Rationalize isdigit / isalpha / tolower / ... uses throughout Python source
Type: enhancement Stage: needs patch
Components: Interpreter Core Versions: Python 3.1, Python 2.7
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: eric.smith Nosy List: eric.smith, mark.dickinson
Priority: normal Keywords: easy

Created on 2009-04-19 12:57 by mark.dickinson, last changed 2009-04-27 21:13 by eric.smith. This issue is now closed.

Messages (5)
msg86170 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-04-19 12:57
Problem: the standard C character handling functions from ctype.h 
(isalpha, isdigit, isxdigit, isspace, toupper, tolower, etc.) are locale 
aware, but for almost all uses CPython needs locale-unaware versions of 
these.

There are various solutions in the current source:

- there's a file Include/bytes_methods.h which provides suitable 
ISDIGIT/ISALPHA/... macros, but also undefines the standard functions.
As it is, it can't be included in Python.h since that would break
3rd party code that includes Python.h and also uses isdigit.

- some files have their own solution:  Python/pystrtod.c defines
its own (probably inefficient) ISDIGIT and ISSPACE macros.

- in some places the standard C functions are just used directly (and 
possibly incorrectly).  A gotcha here is that one has to remember to use 
Py_CHARMASK to avoid errors on some platforms.  (See issue 3633 for an 
example.)

It would be nice to clean all this up, and have one central, efficient, 
easy-to-use set of Py_ISDIGIT/Py_ISALPHA ... locale-independent macros (or 
functions) that could be used safely throughout the Python source.
msg86173 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2009-04-19 15:26
I concur. I've also been bitten by forgetting Py_CHARMASK, so a single
version that took this into account (and was locale-unaware) would be
welcome.

In private mail I'd mentioned that if these are functions, they should
take int. But I now think that's incorrect, and they should take char or
unsigned char. I think the standard C functions take int because they
also allow EOF. I think the Py_ versions should allow only characters
and not allow EOF. Py_CHARMASK already enforces this, anyway, with
likely undefined results.
msg86293 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2009-04-22 12:33
Also, see _toupper/_tolower in Objects/stringlib/stringdef.h and
Objects/stringobject.c. Those should be rationalized as well.
msg86668 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2009-04-27 14:50
I'll implement this by adding a pyctype.h and pyctype.c, mimicking
<ctype.h>. I'll essentially copy and rename the methods in
bytes_methods.[ch], then change bytes_methods.h to refer to the new
versions, for backward compatibility.
msg86698 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2009-04-27 21:13
Checked in to trunk (rr72040) and py3k (r72044).

Windows buildbots look okay, closing.
History
Date User Action Args
2009-04-27 21:13:25eric.smithsetstatus: open -> closed
resolution: accepted
messages: + msg86698
2009-04-27 14:50:02eric.smithsetassignee: eric.smith
messages: + msg86668
2009-04-22 12:33:38eric.smithsetmessages: + msg86293
2009-04-19 15:26:38eric.smithsetmessages: + msg86173
2009-04-19 12:57:45mark.dickinsoncreate