This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Simplify Py_CHARMASK
Type: enhancement Stage: resolved
Components: Interpreter Core Versions: Python 3.1, Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: eric.smith, lemburg, loewis, pitrou, skrah
Priority: normal Keywords: needs review, patch

Created on 2010-06-20 13:12 by skrah, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
py_charmask.patch skrah, 2010-06-20 13:12
py_charmask2.patch skrah, 2010-07-17 17:40
unicodeobject_py_charmask.patch skrah, 2010-07-19 16:53
Messages (12)
msg108234 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010-06-20 13:12
This feature request is extracted from issue 9020, where Py_CHARMASK(c)
was used on EOF, producing different results depending on whether __CHAR_UNSIGNED__ is defined or not.

The preferred fix for issue 9020 is to check for EOF before using
Py_CHARMASK, so this is only loosely related.


I've looked through the source tree to determine how Py_CHARMASK
is meant to be used. It seems that it is only used in situations
where one would actually want to cast a char to an unsigned char,
like isspace((unsigned char)c).


Simplifications:

1) Python.h already enforces that an unsigned char is 8 bit wide. Thus,
   ((unsigned char)((c) & 0xff)) and ((unsigned char)(c)) should produce
   the same results.

2) There is no reason not to do the cast when __CHAR_UNSIGNED__ is
   defined (it will be a no-op).
msg108235 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-06-20 13:27
> Thus,
> ((unsigned char)((c) & 0xff)) and ((unsigned char)(c)) should produce
> the same results.

If it's the case, it's probably optimized away by the compiler anyway.

> There is no reason not to do the cast when __CHAR_UNSIGNED__ is
> defined (it will be a no-op).

Agreed. It also reduces the opportunity for bugs :)
msg108513 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010-06-24 14:47
Antoine Pitrou <report@bugs.python.org> wrote:
> > Thus,
> > ((unsigned char)((c) & 0xff)) and ((unsigned char)(c)) should produce
> > the same results.
> 
> If it's the case, it's probably optimized away by the compiler anyway.

Yes, the asm output is the same. I was more concerned about readability.

Historically, Py_CHARMASK started as ((c) & 0xff), hence the name. In r34974
UCHAR_MAX = 2**CHAR_BIT-1 = 255 was enforced. In r61936 the cast was added.

> > There is no reason not to do the cast when __CHAR_UNSIGNED__ is
> > defined (it will be a no-op).
> 
> Agreed. It also reduces the opportunity for bugs :)

Ok, let's say we use ((unsigned char)((c) & 0xff)) also for __CHAR_UNSIGNED__.

What should the comment say about the intended argument? I've examined all
occurrences in the tree and almost always Py_CHARMASK is used with either
char arguments or int arguments that are directly derived from a char, so
no EOF issues arise.

Exceptions:
===========

Index: Objects/unicodeobject.c
===================================================================
--- Objects/unicodeobject.c     (revision 82192)
+++ Objects/unicodeobject.c     (working copy)
@@ -8417,6 +8417,7 @@
                 else if (c >= '0' && c <= '9') {
                     prec = c - '0';
                     while (--fmtcnt >= 0) {
+                        /* XXX: c and *fmt are Py_UNICODE */
                         c = Py_CHARMASK(*fmt++);
                         if (c < '0' || c > '9')
                             break;

Index: Modules/_json.c
===================================================================
--- Modules/_json.c     (revision 82192)
+++ Modules/_json.c     (working copy)
@@ -603,6 +603,7 @@
             }
         }
         else {
+            /* XXX: c is Py_UNICODE */
             char c_char = Py_CHARMASK(c);
             chunk = PyString_FromStringAndSize(&c_char, 1);
             if (chunk == NULL) {
msg108514 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010-06-24 14:53
> In r61936 the cast was added.

Actually r61987
msg108521 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-06-24 15:25
Stefan Krah wrote:
> 
> Stefan Krah <stefan-usenet@bytereef.org> added the comment:
> 
> Antoine Pitrou <report@bugs.python.org> wrote:
>>> Thus,
>>> ((unsigned char)((c) & 0xff)) and ((unsigned char)(c)) should produce
>>> the same results.
>>
>> If it's the case, it's probably optimized away by the compiler anyway.
> 
> Yes, the asm output is the same. I was more concerned about readability.
> 
> Historically, Py_CHARMASK started as ((c) & 0xff), hence the name. In r34974
> UCHAR_MAX = 2**CHAR_BIT-1 = 255 was enforced. In r61936 the cast was added.
> 
>>> There is no reason not to do the cast when __CHAR_UNSIGNED__ is
>>> defined (it will be a no-op).
>>
>> Agreed. It also reduces the opportunity for bugs :)
> 
> Ok, let's say we use ((unsigned char)((c) & 0xff)) also for __CHAR_UNSIGNED__.
> 
> What should the comment say about the intended argument? I've examined all
> occurrences in the tree and almost always Py_CHARMASK is used with either
> char arguments or int arguments that are directly derived from a char, so
> no EOF issues arise.

Why not just make the casts in those cases explicit instead of
using Py_CHARMASK ?

> Exceptions:
> ===========
> 
> Index: Objects/unicodeobject.c
> ===================================================================
> --- Objects/unicodeobject.c     (revision 82192)
> +++ Objects/unicodeobject.c     (working copy)
> @@ -8417,6 +8417,7 @@
>                  else if (c >= '0' && c <= '9') {
>                      prec = c - '0';
>                      while (--fmtcnt >= 0) {
> +                        /* XXX: c and *fmt are Py_UNICODE */
>                          c = Py_CHARMASK(*fmt++);

If both are Py_UNICODE, why do you need the cast at all ?

Note that the above use also appears to be wrong, since if *fmt
is Py_UNICODE, the following if() will also match if the higher
order bytes of the Py_UNICODE value are non-0.

>                          if (c < '0' || c > '9')
>                              break;
> 
> Index: Modules/_json.c
> ===================================================================
> --- Modules/_json.c     (revision 82192)
> +++ Modules/_json.c     (working copy)
> @@ -603,6 +603,7 @@
>              }
>          }
>          else {
> +            /* XXX: c is Py_UNICODE */
>              char c_char = Py_CHARMASK(c);

In this case you should get a compiler warning due to the
different signedness of the arguments.

Same comment as above: the high order bytes of c are lost
in this conversion.

Why can't _json.c use one of the existing Unicode conversion
APIs ?

>              chunk = PyString_FromStringAndSize(&c_char, 1);
>              if (chunk == NULL) {
> 
> ----------
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue9036>
> _______________________________________
> _______________________________________________
> Python-bugs-list mailing list
> Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/mal%40egenix.com
msg108528 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-06-24 16:27
> Ok, let's say we use ((unsigned char)((c) & 0xff)) also for
> __CHAR_UNSIGNED__.
> 
> What should the comment say about the intended argument?

That it's either in [-128; 127] or in [0; 255] ?

> Index: Objects/unicodeobject.c
> ===================================================================
> --- Objects/unicodeobject.c     (revision 82192)
> +++ Objects/unicodeobject.c     (working copy)
> @@ -8417,6 +8417,7 @@
>                  else if (c >= '0' && c <= '9') {
>                      prec = c - '0';
>                      while (--fmtcnt >= 0) {
> +                        /* XXX: c and *fmt are Py_UNICODE */
>                          c = Py_CHARMASK(*fmt++);

This is a funny bug:

>>> u"%.1\u1032f" % (1./3)
u'0.333333333333'

> Index: Modules/_json.c
> ===================================================================
> --- Modules/_json.c     (revision 82192)
> +++ Modules/_json.c     (working copy)
> @@ -603,6 +603,7 @@
>              }
>          }
>          else {
> +            /* XXX: c is Py_UNICODE */
>              char c_char = Py_CHARMASK(c);

This block can only be entered if c <= 0x7f (`has_unicode` is false), so it's not a problem.
msg110585 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010-07-17 17:40
[Marc-Andre]
> Why not just make the casts in those cases explicit instead of
> using Py_CHARMASK ?

I agree that this would be the cleanest solution. It's harder to get
someone to review a larger patch though.


Antoine, I understood that you would prefer to leave the mask. Could I
apply the second version of the patch?

It will probably have an immediate benefit. On the ARM buildbot char
is unsigned and gcc emits a whole page of spurious 'array subscript has
type 'char' warnings.
msg110728 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-07-19 10:37
> Antoine, I understood that you would prefer to leave the mask. Could I
> apply the second version of the patch?

It looks ok to me.
msg110780 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010-07-19 16:53
Antoine, thanks! Committed in r82966, r82968, r82969 and r82970.


Could we fix the unicodeobject.c bug on the fly? I think the patch should
do, unless non-ascii digits are supported. But in that case several other
changes would be needed.
msg110783 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-07-19 17:11
Eric, is the unicode patch ok for you?
msg110785 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2010-07-19 17:20
Yes, the unicode patch looks okay to me.
msg110891 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010-07-20 12:13
Antoine, Eric, thanks for looking at the patch. unicodeobject.c patch
committed in r82978, r82979, r82980 and r82984.

As Antoine pointed out, the _json.c issue is not a problem. Also, it is
not present in py3k. The reliable buildbots look ok, so I think we can
close this.
History
Date User Action Args
2022-04-11 14:57:02adminsetgithub: 53282
2010-07-20 12:13:36skrahsetstatus: open -> closed
resolution: fixed
messages: + msg110891

stage: patch review -> resolved
2010-07-19 17:20:54eric.smithsetmessages: + msg110785
2010-07-19 17:11:15pitrousetmessages: + msg110783
2010-07-19 16:53:52skrahsetfiles: + unicodeobject_py_charmask.patch

messages: + msg110780
2010-07-19 10:37:22pitrousetmessages: + msg110728
2010-07-17 17:40:27skrahsetfiles: + py_charmask2.patch

messages: + msg110585
2010-06-24 16:27:59pitrousetmessages: + msg108528
2010-06-24 15:25:04lemburgsetnosy: + lemburg
messages: + msg108521
2010-06-24 14:53:22skrahsetmessages: + msg108514
2010-06-24 14:47:29skrahsetmessages: + msg108513
2010-06-20 13:27:02pitrousetmessages: + msg108235
2010-06-20 13:12:06skrahcreate