Issue9036
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010-06-20 13:12 by skrah, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
py_charmask.patch | skrah, 2010-06-20 13:12 | |||
py_charmask2.patch | skrah, 2010-07-17 17:40 | |||
unicodeobject_py_charmask.patch | skrah, 2010-07-19 16:53 |
Messages (12) | |||
---|---|---|---|
msg108234 - (view) | Author: Stefan Krah (skrah) * ![]() |
Date: 2010-06-20 13:12 | |
This feature request is extracted from issue 9020, where Py_CHARMASK(c) was used on EOF, producing different results depending on whether __CHAR_UNSIGNED__ is defined or not. The preferred fix for issue 9020 is to check for EOF before using Py_CHARMASK, so this is only loosely related. I've looked through the source tree to determine how Py_CHARMASK is meant to be used. It seems that it is only used in situations where one would actually want to cast a char to an unsigned char, like isspace((unsigned char)c). Simplifications: 1) Python.h already enforces that an unsigned char is 8 bit wide. Thus, ((unsigned char)((c) & 0xff)) and ((unsigned char)(c)) should produce the same results. 2) There is no reason not to do the cast when __CHAR_UNSIGNED__ is defined (it will be a no-op). |
|||
msg108235 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2010-06-20 13:27 | |
> Thus, > ((unsigned char)((c) & 0xff)) and ((unsigned char)(c)) should produce > the same results. If it's the case, it's probably optimized away by the compiler anyway. > There is no reason not to do the cast when __CHAR_UNSIGNED__ is > defined (it will be a no-op). Agreed. It also reduces the opportunity for bugs :) |
|||
msg108513 - (view) | Author: Stefan Krah (skrah) * ![]() |
Date: 2010-06-24 14:47 | |
Antoine Pitrou <report@bugs.python.org> wrote: > > Thus, > > ((unsigned char)((c) & 0xff)) and ((unsigned char)(c)) should produce > > the same results. > > If it's the case, it's probably optimized away by the compiler anyway. Yes, the asm output is the same. I was more concerned about readability. Historically, Py_CHARMASK started as ((c) & 0xff), hence the name. In r34974 UCHAR_MAX = 2**CHAR_BIT-1 = 255 was enforced. In r61936 the cast was added. > > There is no reason not to do the cast when __CHAR_UNSIGNED__ is > > defined (it will be a no-op). > > Agreed. It also reduces the opportunity for bugs :) Ok, let's say we use ((unsigned char)((c) & 0xff)) also for __CHAR_UNSIGNED__. What should the comment say about the intended argument? I've examined all occurrences in the tree and almost always Py_CHARMASK is used with either char arguments or int arguments that are directly derived from a char, so no EOF issues arise. Exceptions: =========== Index: Objects/unicodeobject.c =================================================================== --- Objects/unicodeobject.c (revision 82192) +++ Objects/unicodeobject.c (working copy) @@ -8417,6 +8417,7 @@ else if (c >= '0' && c <= '9') { prec = c - '0'; while (--fmtcnt >= 0) { + /* XXX: c and *fmt are Py_UNICODE */ c = Py_CHARMASK(*fmt++); if (c < '0' || c > '9') break; Index: Modules/_json.c =================================================================== --- Modules/_json.c (revision 82192) +++ Modules/_json.c (working copy) @@ -603,6 +603,7 @@ } } else { + /* XXX: c is Py_UNICODE */ char c_char = Py_CHARMASK(c); chunk = PyString_FromStringAndSize(&c_char, 1); if (chunk == NULL) { |
|||
msg108514 - (view) | Author: Stefan Krah (skrah) * ![]() |
Date: 2010-06-24 14:53 | |
> In r61936 the cast was added. Actually r61987 |
|||
msg108521 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2010-06-24 15:25 | |
Stefan Krah wrote: > > Stefan Krah <stefan-usenet@bytereef.org> added the comment: > > Antoine Pitrou <report@bugs.python.org> wrote: >>> Thus, >>> ((unsigned char)((c) & 0xff)) and ((unsigned char)(c)) should produce >>> the same results. >> >> If it's the case, it's probably optimized away by the compiler anyway. > > Yes, the asm output is the same. I was more concerned about readability. > > Historically, Py_CHARMASK started as ((c) & 0xff), hence the name. In r34974 > UCHAR_MAX = 2**CHAR_BIT-1 = 255 was enforced. In r61936 the cast was added. > >>> There is no reason not to do the cast when __CHAR_UNSIGNED__ is >>> defined (it will be a no-op). >> >> Agreed. It also reduces the opportunity for bugs :) > > Ok, let's say we use ((unsigned char)((c) & 0xff)) also for __CHAR_UNSIGNED__. > > What should the comment say about the intended argument? I've examined all > occurrences in the tree and almost always Py_CHARMASK is used with either > char arguments or int arguments that are directly derived from a char, so > no EOF issues arise. Why not just make the casts in those cases explicit instead of using Py_CHARMASK ? > Exceptions: > =========== > > Index: Objects/unicodeobject.c > =================================================================== > --- Objects/unicodeobject.c (revision 82192) > +++ Objects/unicodeobject.c (working copy) > @@ -8417,6 +8417,7 @@ > else if (c >= '0' && c <= '9') { > prec = c - '0'; > while (--fmtcnt >= 0) { > + /* XXX: c and *fmt are Py_UNICODE */ > c = Py_CHARMASK(*fmt++); If both are Py_UNICODE, why do you need the cast at all ? Note that the above use also appears to be wrong, since if *fmt is Py_UNICODE, the following if() will also match if the higher order bytes of the Py_UNICODE value are non-0. > if (c < '0' || c > '9') > break; > > Index: Modules/_json.c > =================================================================== > --- Modules/_json.c (revision 82192) > +++ Modules/_json.c (working copy) > @@ -603,6 +603,7 @@ > } > } > else { > + /* XXX: c is Py_UNICODE */ > char c_char = Py_CHARMASK(c); In this case you should get a compiler warning due to the different signedness of the arguments. Same comment as above: the high order bytes of c are lost in this conversion. Why can't _json.c use one of the existing Unicode conversion APIs ? > chunk = PyString_FromStringAndSize(&c_char, 1); > if (chunk == NULL) { > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue9036> > _______________________________________ > _______________________________________________ > Python-bugs-list mailing list > Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/mal%40egenix.com |
|||
msg108528 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2010-06-24 16:27 | |
> Ok, let's say we use ((unsigned char)((c) & 0xff)) also for > __CHAR_UNSIGNED__. > > What should the comment say about the intended argument? That it's either in [-128; 127] or in [0; 255] ? > Index: Objects/unicodeobject.c > =================================================================== > --- Objects/unicodeobject.c (revision 82192) > +++ Objects/unicodeobject.c (working copy) > @@ -8417,6 +8417,7 @@ > else if (c >= '0' && c <= '9') { > prec = c - '0'; > while (--fmtcnt >= 0) { > + /* XXX: c and *fmt are Py_UNICODE */ > c = Py_CHARMASK(*fmt++); This is a funny bug: >>> u"%.1\u1032f" % (1./3) u'0.333333333333' > Index: Modules/_json.c > =================================================================== > --- Modules/_json.c (revision 82192) > +++ Modules/_json.c (working copy) > @@ -603,6 +603,7 @@ > } > } > else { > + /* XXX: c is Py_UNICODE */ > char c_char = Py_CHARMASK(c); This block can only be entered if c <= 0x7f (`has_unicode` is false), so it's not a problem. |
|||
msg110585 - (view) | Author: Stefan Krah (skrah) * ![]() |
Date: 2010-07-17 17:40 | |
[Marc-Andre] > Why not just make the casts in those cases explicit instead of > using Py_CHARMASK ? I agree that this would be the cleanest solution. It's harder to get someone to review a larger patch though. Antoine, I understood that you would prefer to leave the mask. Could I apply the second version of the patch? It will probably have an immediate benefit. On the ARM buildbot char is unsigned and gcc emits a whole page of spurious 'array subscript has type 'char' warnings. |
|||
msg110728 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2010-07-19 10:37 | |
> Antoine, I understood that you would prefer to leave the mask. Could I > apply the second version of the patch? It looks ok to me. |
|||
msg110780 - (view) | Author: Stefan Krah (skrah) * ![]() |
Date: 2010-07-19 16:53 | |
Antoine, thanks! Committed in r82966, r82968, r82969 and r82970. Could we fix the unicodeobject.c bug on the fly? I think the patch should do, unless non-ascii digits are supported. But in that case several other changes would be needed. |
|||
msg110783 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2010-07-19 17:11 | |
Eric, is the unicode patch ok for you? |
|||
msg110785 - (view) | Author: Eric V. Smith (eric.smith) * ![]() |
Date: 2010-07-19 17:20 | |
Yes, the unicode patch looks okay to me. |
|||
msg110891 - (view) | Author: Stefan Krah (skrah) * ![]() |
Date: 2010-07-20 12:13 | |
Antoine, Eric, thanks for looking at the patch. unicodeobject.c patch committed in r82978, r82979, r82980 and r82984. As Antoine pointed out, the _json.c issue is not a problem. Also, it is not present in py3k. The reliable buildbots look ok, so I think we can close this. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:02 | admin | set | github: 53282 |
2010-07-20 12:13:36 | skrah | set | status: open -> closed resolution: fixed messages: + msg110891 stage: patch review -> resolved |
2010-07-19 17:20:54 | eric.smith | set | messages: + msg110785 |
2010-07-19 17:11:15 | pitrou | set | messages: + msg110783 |
2010-07-19 16:53:52 | skrah | set | files:
+ unicodeobject_py_charmask.patch messages: + msg110780 |
2010-07-19 10:37:22 | pitrou | set | messages: + msg110728 |
2010-07-17 17:40:27 | skrah | set | files:
+ py_charmask2.patch messages: + msg110585 |
2010-06-24 16:27:59 | pitrou | set | messages: + msg108528 |
2010-06-24 15:25:04 | lemburg | set | nosy:
+ lemburg messages: + msg108521 |
2010-06-24 14:53:22 | skrah | set | messages: + msg108514 |
2010-06-24 14:47:29 | skrah | set | messages: + msg108513 |
2010-06-20 13:27:02 | pitrou | set | messages: + msg108235 |
2010-06-20 13:12:06 | skrah | create |