msg97334 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2010-01-07 01:07 |
On Py2.x u'%c' % char returns the wrong result if char is in range '\x80'-'\xFF':
>>> u'%c' % '\x7f'
u'\x7f' # correct
>>> u'%c' % '\x80'
u'\uff80' # broken
>>> u'%c' % '\xFF'
u'\uffff' # broken
This is what happens whit %s and 0x80:
>>> u'%s' % '\x80'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal
not in range(128)
>>> u'%c' % 0x80
u'\x80'
u'%c' % '\x80' should raise the same error raised by u'%s' % '\x80'.
|
msg97336 - (view) |
Author: Eric V. Smith (eric.smith) *  |
Date: 2010-01-07 01:41 |
This looks like a signed vs. unsigned char problem. What platform are you on?
|
msg97338 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-01-07 01:55 |
The problem is the cast char => Py_UNICODE in formatchar() function. char may be unsigned, but most of the time it's signed. signed => unsigned cast extend the sign. Example: (unsigned short)(signed char)0x80 gives 0xFF80.
Attached patch fixes the issue and includes an unit test.
|
msg97339 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-01-07 02:01 |
The problem is specific to Python 2.x. With Python3, "%c" expects one unicode character (eg. "a").
My patch fixes the char => Py_UNICODE conversion, but raising an error is maybe better to be consistent with u"%s" % "\x80" (and prepare the migration the Python3).
|
msg97340 - (view) |
Author: Eric V. Smith (eric.smith) *  |
Date: 2010-01-07 02:48 |
Shouldn't it work the same as it does for integers?
>>> u'%c' % 0x7f
u'\x7f'
>>> u'%c' % '\x7f'
u'\x7f'
>>> u'%c' % 0x80
u'\x80'
>>> u'%c' % '\x80'
u'\uff80'
That would imply to me it shouldn't be an error, it should just return u'\x80'.
|
msg97349 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2010-01-07 11:49 |
If we allow it to work on 2.7 the code will break:
1) when ported to Py3, where mixing bytes strings and unicode is not allowed;
2) when used on Py<2.7, where the behavior is broken;
3) when converted to str.format, where only ints are accepted.
If we raise an error, the user will have to either use unicode strings or ints and he will avoid further problems. Moreover, the fact that no one noticed this problems means that is not a common operation, so no one probably expects it to work anyway and raising an error could even help to find the problem if someone used %c in older versions.
|
msg97350 - (view) |
Author: Eric V. Smith (eric.smith) *  |
Date: 2010-01-07 11:55 |
There's no perfect answer. And since we've gotten this far without anyone caring, and 2.7 is basically the end of life for this issue, perhaps doing nothing is the best course. Any change we make will affect code that runs in both 2.6 and 2.7, doing nothing preserves compatibility on the 2.x side.
|
msg97353 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2010-01-07 13:02 |
Tested on 2.5...
~ $ python2.5
Python 2.5.2 (r252:60911, Jan 4 2009, 21:59:32)
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'%c' % '\x80'
u'\Uffffff80'
>>>
On 2.7 (trunk) I get same behaviour as described on msg97334.
|
msg97354 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2010-01-07 13:35 |
First patch that makes u'%c' % '\x80' raise a UnicodeDecodeError.
I could reproduce the problem on Linux 32/64bit, Windows 32bit and Python from 2.4 to 2.7. The '\Uffffff80' is returned by wide builds.
|
msg98108 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-01-21 11:43 |
@Ezio: Your patch leaks a reference: PyUnicode_FromString(...) is not destroyed (Py_DECREF) on success.
|
msg99811 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2010-02-22 19:00 |
New patch (issue7649v2.diff) with refleak fixed and improved unittests.
|
msg99856 - (view) |
Author: Eric V. Smith (eric.smith) *  |
Date: 2010-02-22 22:20 |
Could you add a comment on why you're calling PyUnicode_FromString and then throwing away the result? I believe it's so you'll get the same error checking as PyUnicode_FromString, but it's sufficiently tricky that I think it deserves a comment.
Now that I think about it, having done the conversion in PyUnicode_FromString, why not use:
buf[0] = PyUnicode_AS_UNICODE(s)[0];
? This way you skip the cast and you know for a fact you're using the same interpretation as PyUnicode_FromString, having used its output.
Also, since you know the size you may as well call PyUnicode_FromStringAndSize. It's trivially faster and makes the intent clearer, I think.
|
msg99900 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2010-02-23 06:18 |
Indeed some comments would be helpful, I'll add them.
I also tried already to reuse 's' and extract the first char using unicode_getitem, but it returns a PyObject and anyway it's probably more expensive/complicated than calling PyString_AS_STRING again. I'll try with [0] and/or with PyUnicode_FromStringAndSize and make a new patch this afternoon.
|
msg99912 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2010-02-23 13:45 |
Here's a new patch that uses PyUnicode_FromStringAndSize and PyUnicode_AS_UNICODE(s)[0]. I also added some comments and tested it and it seems to work fine.
|
msg99914 - (view) |
Author: Eric V. Smith (eric.smith) *  |
Date: 2010-02-23 13:51 |
The patch looks good to me, in that it implements your desired functionality well.
I haven't been following the issue closely enough to know whether or not this new functionality is the right way to go. (I'm not saying it's not, just that I don't have an opinion on it.)
|
msg99973 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-02-23 23:23 |
Commited: r78392 (r78394).
|
msg100010 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2010-02-24 08:15 |
> --- python/trunk/Objects/unicodeobject.c (original)
> > +++ python/trunk/Objects/unicodeobject.c Wed Feb 24 00:16:07 2010
> > @@ -8170,6 +8170,7 @@
> > size_t buflen,
> > PyObject *v)
> > {
> > + PyObject *s;
> > /* presume that the buffer is at least 2 characters long */
> > if (PyUnicode_Check(v)) {
> > if (PyUnicode_GET_SIZE(v) != 1)
> > @@ -8180,7 +8181,14 @@
> > else if (PyString_Check(v)) {
> > if (PyString_GET_SIZE(v) != 1)
> > goto onError;
> > - buf[0] = (Py_UNICODE)PyString_AS_STRING(v)[0];
> > + /* #7649: if the char is a non-ascii (i.e. in range(0x80,0x100)) byte
> > + string, "u'%c' % char" should fail with a UnicodeDecodeError */
> > + s = PyUnicode_FromStringAndSize(PyString_AS_STRING(v), 1);
> > + /* if the char is not decodable return -1 */
> > + if (s == NULL)
> > + return -1;
> > + buf[0] = PyUnicode_AS_UNICODE(s)[0];
> > + Py_DECREF(s);
That's a *very* inefficient way of doing this.
Could you please check for chars above 0x7f first and then use
PyUnicode_Decode() instead of the PyUnicode_FromStringAndSize()
API (this API should not have been backported from the Python 3.x
in Python 2.6, but it's too late to change that now) !
Thanks.
|
msg100018 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2010-02-24 09:44 |
> Could you please check for chars above 0x7f first and then use
> PyUnicode_Decode() instead of the PyUnicode_FromStringAndSize() API
I concur: PyUnicode_FromStringAndSize() decodes with utf-8 whereas the expected conversion char->unicode should use the default encoding (ascii).
But why is it necessary to check for chars above 0x7f?
> (this API should not have been backported from the Python 3.x
> in Python 2.6,
This function is still useful when the chars come from a C string literal in the source code (btw there should be something about the encoding used in C files). But it's not always correctly used even in 3.x, in posixmodule.c for example.
|
msg100021 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2010-02-24 10:02 |
Amaury Forgeot d'Arc wrote:
>
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
>
>> Could you please check for chars above 0x7f first and then use
>> PyUnicode_Decode() instead of the PyUnicode_FromStringAndSize() API
>
> I concur: PyUnicode_FromStringAndSize() decodes with utf-8 whereas the expected conversion char->unicode should use the default encoding (ascii).
> But why is it necessary to check for chars above 0x7f?
The Python default encoding has to be ASCII compatible,
so it's better to use a short-cut for pure-ASCII characters
and avoid the complete round-trip via a temporary Unicode
object.
>> (this API should not have been backported from the Python 3.x
>> in Python 2.6,
> This function is still useful when the chars come from a C string literal in the source code (btw there should be something about the encoding used in C files). But it's not always correctly used even in 3.x, in posixmodule.c for example.
The function is a really just yet another interface to the
PyUnicode_DecodeUTF8() API and it's name is misleading in that:
Python 2.x uses the default encoding for converting strings without
known encoding to Unicode, the docs for the API say that
it decodes Latin-1 (!) and the interface makes it looks like
a drop-in replacement for PyString_FromStringAndSize() which
it isn't for Python 2.x.
For Python 3.x, the default encoding is fixed to UTF-8, so the
situation is different (though the docs are still wrong),
however I don't see the advantage of using a less explicit
name over the direct use of PyUnicode_DecodeUTF8().
|
msg100023 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2010-02-24 10:39 |
> > But why is it necessary to check for chars above 0x7f?
> The Python default encoding has to be ASCII compatible,
Yes, but it is not necessarily as strict.
for example, after I manage to set the default encoding to latin-1,
u"%s" % chr(0x80) works; I suppose %c should do the same.
|
msg100024 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2010-02-24 10:50 |
Amaury Forgeot d'Arc wrote:
>
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
>
>>> But why is it necessary to check for chars above 0x7f?
>> The Python default encoding has to be ASCII compatible,
> Yes, but it is not necessarily as strict.
> for example, after I manage to set the default encoding to latin-1,
> u"%s" % chr(0x80) works; I suppose %c should do the same.
Right and it will.
I only meant the check as fast-path for characters <0x80. All
other characters would need to go through PyUnicode_Decode().
We can use such a fast-path, since the default encoding has to
be ASCII compatible.
|
msg100025 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2010-02-24 10:59 |
I was trying to decode mainly to get a UnicodeDecodeError automatically.
If I check if the char is not in the ASCII range (i.e. >0x7F) I think I'd have to set the error message for the UnicodeDecodeError manually and possibly duplicate it (unless we use a different error message that says that %c accepts only ASCII chars).
Also I agree that if u'%s' % chr(0x80) works when the default encoding is not ASCII, then %c should work as well. Trying to decode it with the default encoding and possibly let the UnicodeDecodeError propagate seems a good solution to me (and performance shouldn't be a problem here, since apparently no one uses u'%c' with non-ASCII byte strings).
I will try to make another patch.
|
msg100027 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2010-02-24 11:03 |
Marc-André's remark was that if char<0x80 (the vast majority), it's not necessary to call any decode function: just copy the byte. Other cases (error, or non-default encoding) may use a slower path.
|
msg100028 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2010-02-24 11:07 |
Ok, adding a fast path shouldn't make the code more complicated, so I'll include it in the patch, thanks for the feedback.
|
msg100059 - (view) |
Author: Eric V. Smith (eric.smith) *  |
Date: 2010-02-24 18:03 |
I'm working on a similar issue for int.__format__('c'). What's not clear to me is why this doesn't work the same as chr(i). That is, shouldn't chr(i) == ('%c' % i) hold for i in range(256)? And if that's so, why not just copy chr's implementation:
if (x < 0 || x >= 256) {
PyErr_SetString(PyExc_ValueError,
"chr() arg not in range(256)");
return NULL;
}
s[0] = (char)x;
return PyString_FromStringAndSize(s, 1);
|
msg100060 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2010-02-24 18:09 |
The operation here is different:
u'%c' % chr(i), which involves a str->unicode conversion.
|
msg100062 - (view) |
Author: Eric V. Smith (eric.smith) *  |
Date: 2010-02-24 18:17 |
Of course. Sorry about that.
But then why not copy unichr()? It calls PyUnicode_FromOrdinal.
I guess my real question is:
Should "'%c' % i" be identical to "chr(i)" and should "u'%c' % i" be identical to "unichr(i)"?
And by "identical" I mean return the same results and throw the same errors (with a possible difference in the error text).
|
msg100063 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2010-02-24 18:25 |
At least from point of view, the difference between ints and chars is:
* "u'%c' % 0xB5" means "create a Unicode char with codepoint 0xB5", i.e. the Unicode char µ U+00B5 MICRO SIGN;
* "u'%c' % '\xB5'" means "create a Unicode char converting the byte '\xB5' to Unicode", i.e. an Е U+0415 CYRILLIC CAPITAL LETTER IE if the byte string is encoded in iso-8859-5 or ต U+0E15 THAI CHARACTER TO TAO if it's encoded in iso-8859-11, and so on for every different encoding.
|
msg100065 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2010-02-24 18:47 |
Just to clarify:
"u'%c' % some_int" should behave like unichr(some_int);
"u'%c' % some_chr" should behave like "u'%s' % some_chr".
|
msg100070 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2010-02-24 21:40 |
Ezio Melotti wrote:
> Just to clarify:
> "u'%c' % some_int" should behave like unichr(some_int);
> "u'%c' % some_chr" should behave like "u'%s' % some_chr".
That's correct.
I guess that in practice "u'%c' % some_chr" is not all that common.
Otherwise, the fact that it doesn't raise an error if some_chr
is not compatible to the default encoding would have been noticed
earlier on.
|
msg100087 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2010-02-25 14:46 |
The latest patch (issue7649v4.diff) checks if the char is ASCII or non-ASCII and then, if the char is ASCII, it converts it directly to Unicode, otherwise it tries to decode it using the default encoding, raising a UnicodeDecodeError if the decoding fails.
I tested it setting iso-8859-1 and utf-8 as default encoding and the behavior was consistent with "%s", however the tests assume that the default encoding is always ASCII, so they failed (both the tests included in the patch and others in test_unicode). I'm not sure if in this case they should be changed/skipped or not.
(Also http://docs.python.org/c-api/unicode.html#built-in-codecs says that "Setting encoding to NULL causes the default encoding to be used which is ASCII.", but this is not always true. If you think it should be fixed I'll do it in a separate commit.)
|
msg100091 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2010-02-25 16:43 |
Ezio Melotti wrote:
>
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
>
> The latest patch (issue7649v4.diff) checks if the char is ASCII or non-ASCII and then, if the char is ASCII, it converts it directly to Unicode, otherwise it tries to decode it using the default encoding, raising a UnicodeDecodeError if the decoding fails.
Thanks. The patch looks good now... but doesn't apply cleanly anymore,
since your first version has already made it into trunk and the 2.6 branch.
> I tested it setting iso-8859-1 and utf-8 as default encoding and the behavior was consistent with "%s", however the tests assume that the default encoding is always ASCII, so they failed (both the tests included in the patch and others in test_unicode). I'm not sure if in this case they should be changed/skipped or not.
I think that's fine. While we do still allow setting the default
to something other than ASCII in 2.x, we don't support such tricks,
so there's no need to test for them.
> (Also http://docs.python.org/c-api/unicode.html#built-in-codecs says that "Setting encoding to NULL causes the default encoding to be used which is ASCII.", but this is not always true. If you think it should be fixed I'll do it in a separate commit.)
The last part of that sentence should be removed.
|
msg100092 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2010-02-25 17:06 |
The patch should apply cleanly on the latest revision of trunk (if you look at the diff it removes the code of the first version).
About the doc, I think it would be better to say that usually it's ASCII (or UTF-8 on Py3 -- Py3 doc should be fixed too), but it might be something different if it's changed using sys.setdefaultencoding().
|
msg100093 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2010-02-25 17:13 |
Ezio Melotti wrote:
>
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
>
> The patch should apply cleanly on the latest revision of trunk (if you look at the diff it removes the code of the first version).
Ah, sorry, my fault.
> About the doc, I think it would be better to say that usually it's ASCII (or UTF-8 on Py3 -- Py3 doc should be fixed too), but it might be something different if it's changed using sys.setdefaultencoding().
I guess it's better to open a separate ticket for that.
|
msg100095 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2010-02-25 17:58 |
issue7649v4.diff has been committed in r78449 (trunk) and r78450 (release26-maint).
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:56 | admin | set | github: 51898 |
2010-02-25 17:58:08 | ezio.melotti | set | status: open -> closed resolution: fixed messages:
+ msg100095
stage: patch review -> resolved |
2010-02-25 17:13:29 | lemburg | set | messages:
+ msg100093 |
2010-02-25 17:06:38 | ezio.melotti | set | messages:
+ msg100092 |
2010-02-25 16:43:15 | lemburg | set | messages:
+ msg100091 |
2010-02-25 14:46:19 | ezio.melotti | set | files:
+ issue7649v4.diff
messages:
+ msg100087 stage: resolved -> patch review |
2010-02-24 21:40:13 | lemburg | set | messages:
+ msg100070 |
2010-02-24 18:47:14 | ezio.melotti | set | messages:
+ msg100065 |
2010-02-24 18:25:59 | ezio.melotti | set | messages:
+ msg100063 |
2010-02-24 18:17:55 | eric.smith | set | messages:
+ msg100062 |
2010-02-24 18:09:27 | amaury.forgeotdarc | set | messages:
+ msg100060 |
2010-02-24 18:03:05 | eric.smith | set | messages:
+ msg100059 |
2010-02-24 11:07:07 | ezio.melotti | set | messages:
+ msg100028 |
2010-02-24 11:03:12 | amaury.forgeotdarc | set | messages:
+ msg100027 |
2010-02-24 10:59:33 | ezio.melotti | set | messages:
+ msg100025 |
2010-02-24 10:50:15 | lemburg | set | messages:
+ msg100024 |
2010-02-24 10:39:33 | amaury.forgeotdarc | set | messages:
+ msg100023 |
2010-02-24 10:02:42 | lemburg | set | messages:
+ msg100021 |
2010-02-24 09:44:00 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages:
+ msg100018
|
2010-02-24 08:15:41 | lemburg | set | status: closed -> open
nosy:
+ lemburg messages:
+ msg100010
resolution: fixed -> (no value) |
2010-02-24 06:43:35 | ezio.melotti | set | stage: patch review -> resolved |
2010-02-23 23:23:38 | vstinner | set | status: open -> closed resolution: fixed messages:
+ msg99973
|
2010-02-23 13:51:36 | eric.smith | set | messages:
+ msg99914 |
2010-02-23 13:45:41 | ezio.melotti | set | files:
+ issue7649v3.diff
messages:
+ msg99912 |
2010-02-23 06:18:15 | ezio.melotti | set | messages:
+ msg99900 |
2010-02-22 22:20:56 | eric.smith | set | messages:
+ msg99856 |
2010-02-22 19:00:08 | ezio.melotti | set | keywords:
+ needs review files:
+ issue7649v2.diff messages:
+ msg99811
|
2010-01-21 11:43:48 | vstinner | set | messages:
+ msg98108 |
2010-01-07 13:35:56 | ezio.melotti | set | files:
+ issue7649.diff
messages:
+ msg97354 |
2010-01-07 13:02:37 | flox | set | nosy:
+ flox messages:
+ msg97353
|
2010-01-07 11:55:58 | eric.smith | set | messages:
+ msg97350 |
2010-01-07 11:49:16 | ezio.melotti | set | messages:
+ msg97349 |
2010-01-07 02:50:47 | eric.smith | set | priority: high -> normal keywords:
+ easy stage: test needed -> patch review |
2010-01-07 02:48:37 | eric.smith | set | messages:
+ msg97340 |
2010-01-07 02:01:52 | vstinner | set | messages:
+ msg97339 |
2010-01-07 01:55:45 | vstinner | set | files:
+ unicode_formatchar.patch
nosy:
+ vstinner messages:
+ msg97338
keywords:
+ patch |
2010-01-07 01:46:04 | eric.smith | set | nosy:
+ doerwalter
|
2010-01-07 01:41:57 | eric.smith | set | messages:
+ msg97336 |
2010-01-07 01:41:23 | eric.smith | set | nosy:
+ eric.smith
|
2010-01-07 01:07:59 | ezio.melotti | create | |