Issue 7649: "u'%c' % char" broken for chars in range '\x80'-'\xFF'

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/51898

classification

Title:	"u'%c' % char" broken for chars in range '\x80'-'\xFF'
Type:	behavior	Stage:	resolved
Components:	Interpreter Core, Unicode	Versions:	Python 2.7, Python 2.6

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	ezio.melotti	Nosy List:	amaury.forgeotdarc, doerwalter, eric.smith, ezio.melotti, flox, lemburg, vstinner
Priority:	normal	Keywords:	easy, needs review, patch

Created on 2010-01-07 01:07 by ezio.melotti, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
unicode_formatchar.patch	vstinner, 2010-01-07 01:55
issue7649.diff	ezio.melotti, 2010-01-07 13:35	Patch to raise an error + unittests
issue7649v2.diff	ezio.melotti, 2010-02-22 19:00	Patch to raise an error + unittests
issue7649v3.diff	ezio.melotti, 2010-02-23 13:45	Patch to raise an error + unittests
issue7649v4.diff	ezio.melotti, 2010-02-25 14:46	Patch to raise an error + unittests

Messages (35)
msg97334 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-01-07 01:07
On Py2.x u'%c' % char returns the wrong result if char is in range '\x80'-'\xFF': >>> u'%c' % '\x7f' u'\x7f' # correct >>> u'%c' % '\x80' u'\uff80' # broken >>> u'%c' % '\xFF' u'\uffff' # broken This is what happens whit %s and 0x80: >>> u'%s' % '\x80' Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128) >>> u'%c' % 0x80 u'\x80' u'%c' % '\x80' should raise the same error raised by u'%s' % '\x80'.
msg97336 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2010-01-07 01:41
This looks like a signed vs. unsigned char problem. What platform are you on?
msg97338 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-01-07 01:55
The problem is the cast char => Py_UNICODE in formatchar() function. char may be unsigned, but most of the time it's signed. signed => unsigned cast extend the sign. Example: (unsigned short)(signed char)0x80 gives 0xFF80. Attached patch fixes the issue and includes an unit test.
msg97339 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-01-07 02:01
The problem is specific to Python 2.x. With Python3, "%c" expects one unicode character (eg. "a"). My patch fixes the char => Py_UNICODE conversion, but raising an error is maybe better to be consistent with u"%s" % "\x80" (and prepare the migration the Python3).
msg97340 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2010-01-07 02:48
Shouldn't it work the same as it does for integers? >>> u'%c' % 0x7f u'\x7f' >>> u'%c' % '\x7f' u'\x7f' >>> u'%c' % 0x80 u'\x80' >>> u'%c' % '\x80' u'\uff80' That would imply to me it shouldn't be an error, it should just return u'\x80'.
msg97349 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-01-07 11:49
If we allow it to work on 2.7 the code will break: 1) when ported to Py3, where mixing bytes strings and unicode is not allowed; 2) when used on Py<2.7, where the behavior is broken; 3) when converted to str.format, where only ints are accepted. If we raise an error, the user will have to either use unicode strings or ints and he will avoid further problems. Moreover, the fact that no one noticed this problems means that is not a common operation, so no one probably expects it to work anyway and raising an error could even help to find the problem if someone used %c in older versions.
msg97350 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2010-01-07 11:55
There's no perfect answer. And since we've gotten this far without anyone caring, and 2.7 is basically the end of life for this issue, perhaps doing nothing is the best course. Any change we make will affect code that runs in both 2.6 and 2.7, doing nothing preserves compatibility on the 2.x side.
msg97353 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-01-07 13:02
Tested on 2.5... ~ $ python2.5 Python 2.5.2 (r252:60911, Jan 4 2009, 21:59:32) [GCC 4.3.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> u'%c' % '\x80' u'\Uffffff80' >>> On 2.7 (trunk) I get same behaviour as described on msg97334.
msg97354 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-01-07 13:35
First patch that makes u'%c' % '\x80' raise a UnicodeDecodeError. I could reproduce the problem on Linux 32/64bit, Windows 32bit and Python from 2.4 to 2.7. The '\Uffffff80' is returned by wide builds.
msg98108 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-01-21 11:43
@Ezio: Your patch leaks a reference: PyUnicode_FromString(...) is not destroyed (Py_DECREF) on success.
msg99811 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-02-22 19:00
New patch (issue7649v2.diff) with refleak fixed and improved unittests.
msg99856 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2010-02-22 22:20
Could you add a comment on why you're calling PyUnicode_FromString and then throwing away the result? I believe it's so you'll get the same error checking as PyUnicode_FromString, but it's sufficiently tricky that I think it deserves a comment. Now that I think about it, having done the conversion in PyUnicode_FromString, why not use: buf[0] = PyUnicode_AS_UNICODE(s)[0]; ? This way you skip the cast and you know for a fact you're using the same interpretation as PyUnicode_FromString, having used its output. Also, since you know the size you may as well call PyUnicode_FromStringAndSize. It's trivially faster and makes the intent clearer, I think.
msg99900 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-02-23 06:18
Indeed some comments would be helpful, I'll add them. I also tried already to reuse 's' and extract the first char using unicode_getitem, but it returns a PyObject and anyway it's probably more expensive/complicated than calling PyString_AS_STRING again. I'll try with [0] and/or with PyUnicode_FromStringAndSize and make a new patch this afternoon.
msg99912 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-02-23 13:45
Here's a new patch that uses PyUnicode_FromStringAndSize and PyUnicode_AS_UNICODE(s)[0]. I also added some comments and tested it and it seems to work fine.
msg99914 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2010-02-23 13:51
The patch looks good to me, in that it implements your desired functionality well. I haven't been following the issue closely enough to know whether or not this new functionality is the right way to go. (I'm not saying it's not, just that I don't have an opinion on it.)
msg99973 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-02-23 23:23
Commited: r78392 (r78394).
msg100010 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-02-24 08:15
> --- python/trunk/Objects/unicodeobject.c (original) > > +++ python/trunk/Objects/unicodeobject.c Wed Feb 24 00:16:07 2010 > > @@ -8170,6 +8170,7 @@ > > size_t buflen, > > PyObject v) > > { > > + PyObject s; > > /* presume that the buffer is at least 2 characters long / > > if (PyUnicode_Check(v)) { > > if (PyUnicode_GET_SIZE(v) != 1) > > @@ -8180,7 +8181,14 @@ > > else if (PyString_Check(v)) { > > if (PyString_GET_SIZE(v) != 1) > > goto onError; > > - buf[0] = (Py_UNICODE)PyString_AS_STRING(v)[0]; > > + / #7649: if the char is a non-ascii (i.e. in range(0x80,0x100)) byte > > + string, "u'%c' % char" should fail with a UnicodeDecodeError / > > + s = PyUnicode_FromStringAndSize(PyString_AS_STRING(v), 1); > > + / if the char is not decodable return -1 / > > + if (s == NULL) > > + return -1; > > + buf[0] = PyUnicode_AS_UNICODE(s)[0]; > > + Py_DECREF(s); That's a very* inefficient way of doing this. Could you please check for chars above 0x7f first and then use PyUnicode_Decode() instead of the PyUnicode_FromStringAndSize() API (this API should not have been backported from the Python 3.x in Python 2.6, but it's too late to change that now) ! Thanks.
msg100018 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-02-24 09:44
> Could you please check for chars above 0x7f first and then use > PyUnicode_Decode() instead of the PyUnicode_FromStringAndSize() API I concur: PyUnicode_FromStringAndSize() decodes with utf-8 whereas the expected conversion char->unicode should use the default encoding (ascii). But why is it necessary to check for chars above 0x7f? > (this API should not have been backported from the Python 3.x > in Python 2.6, This function is still useful when the chars come from a C string literal in the source code (btw there should be something about the encoding used in C files). But it's not always correctly used even in 3.x, in posixmodule.c for example.
msg100021 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-02-24 10:02
Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > >> Could you please check for chars above 0x7f first and then use >> PyUnicode_Decode() instead of the PyUnicode_FromStringAndSize() API > > I concur: PyUnicode_FromStringAndSize() decodes with utf-8 whereas the expected conversion char->unicode should use the default encoding (ascii). > But why is it necessary to check for chars above 0x7f? The Python default encoding has to be ASCII compatible, so it's better to use a short-cut for pure-ASCII characters and avoid the complete round-trip via a temporary Unicode object. >> (this API should not have been backported from the Python 3.x >> in Python 2.6, > This function is still useful when the chars come from a C string literal in the source code (btw there should be something about the encoding used in C files). But it's not always correctly used even in 3.x, in posixmodule.c for example. The function is a really just yet another interface to the PyUnicode_DecodeUTF8() API and it's name is misleading in that: Python 2.x uses the default encoding for converting strings without known encoding to Unicode, the docs for the API say that it decodes Latin-1 (!) and the interface makes it looks like a drop-in replacement for PyString_FromStringAndSize() which it isn't for Python 2.x. For Python 3.x, the default encoding is fixed to UTF-8, so the situation is different (though the docs are still wrong), however I don't see the advantage of using a less explicit name over the direct use of PyUnicode_DecodeUTF8().
msg100023 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-02-24 10:39
> > But why is it necessary to check for chars above 0x7f? > The Python default encoding has to be ASCII compatible, Yes, but it is not necessarily as strict. for example, after I manage to set the default encoding to latin-1, u"%s" % chr(0x80) works; I suppose %c should do the same.
msg100024 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-02-24 10:50
Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > >>> But why is it necessary to check for chars above 0x7f? >> The Python default encoding has to be ASCII compatible, > Yes, but it is not necessarily as strict. > for example, after I manage to set the default encoding to latin-1, > u"%s" % chr(0x80) works; I suppose %c should do the same. Right and it will. I only meant the check as fast-path for characters <0x80. All other characters would need to go through PyUnicode_Decode(). We can use such a fast-path, since the default encoding has to be ASCII compatible.
msg100025 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-02-24 10:59
I was trying to decode mainly to get a UnicodeDecodeError automatically. If I check if the char is not in the ASCII range (i.e. >0x7F) I think I'd have to set the error message for the UnicodeDecodeError manually and possibly duplicate it (unless we use a different error message that says that %c accepts only ASCII chars). Also I agree that if u'%s' % chr(0x80) works when the default encoding is not ASCII, then %c should work as well. Trying to decode it with the default encoding and possibly let the UnicodeDecodeError propagate seems a good solution to me (and performance shouldn't be a problem here, since apparently no one uses u'%c' with non-ASCII byte strings). I will try to make another patch.
msg100027 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-02-24 11:03
Marc-André's remark was that if char<0x80 (the vast majority), it's not necessary to call any decode function: just copy the byte. Other cases (error, or non-default encoding) may use a slower path.
msg100028 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-02-24 11:07
Ok, adding a fast path shouldn't make the code more complicated, so I'll include it in the patch, thanks for the feedback.
msg100059 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2010-02-24 18:03
I'm working on a similar issue for int.__format__('c'). What's not clear to me is why this doesn't work the same as chr(i). That is, shouldn't chr(i) == ('%c' % i) hold for i in range(256)? And if that's so, why not just copy chr's implementation: if (x < 0 \|\| x >= 256) { PyErr_SetString(PyExc_ValueError, "chr() arg not in range(256)"); return NULL; } s[0] = (char)x; return PyString_FromStringAndSize(s, 1);
msg100060 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-02-24 18:09
The operation here is different: u'%c' % chr(i), which involves a str->unicode conversion.
msg100062 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2010-02-24 18:17
Of course. Sorry about that. But then why not copy unichr()? It calls PyUnicode_FromOrdinal. I guess my real question is: Should "'%c' % i" be identical to "chr(i)" and should "u'%c' % i" be identical to "unichr(i)"? And by "identical" I mean return the same results and throw the same errors (with a possible difference in the error text).
msg100063 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-02-24 18:25
At least from point of view, the difference between ints and chars is: * "u'%c' % 0xB5" means "create a Unicode char with codepoint 0xB5", i.e. the Unicode char µ U+00B5 MICRO SIGN; * "u'%c' % '\xB5'" means "create a Unicode char converting the byte '\xB5' to Unicode", i.e. an Е U+0415 CYRILLIC CAPITAL LETTER IE if the byte string is encoded in iso-8859-5 or ต U+0E15 THAI CHARACTER TO TAO if it's encoded in iso-8859-11, and so on for every different encoding.
msg100065 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-02-24 18:47
Just to clarify: "u'%c' % some_int" should behave like unichr(some_int); "u'%c' % some_chr" should behave like "u'%s' % some_chr".
msg100070 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-02-24 21:40
Ezio Melotti wrote: > Just to clarify: > "u'%c' % some_int" should behave like unichr(some_int); > "u'%c' % some_chr" should behave like "u'%s' % some_chr". That's correct. I guess that in practice "u'%c' % some_chr" is not all that common. Otherwise, the fact that it doesn't raise an error if some_chr is not compatible to the default encoding would have been noticed earlier on.
msg100087 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-02-25 14:46
The latest patch (issue7649v4.diff) checks if the char is ASCII or non-ASCII and then, if the char is ASCII, it converts it directly to Unicode, otherwise it tries to decode it using the default encoding, raising a UnicodeDecodeError if the decoding fails. I tested it setting iso-8859-1 and utf-8 as default encoding and the behavior was consistent with "%s", however the tests assume that the default encoding is always ASCII, so they failed (both the tests included in the patch and others in test_unicode). I'm not sure if in this case they should be changed/skipped or not. (Also http://docs.python.org/c-api/unicode.html#built-in-codecs says that "Setting encoding to NULL causes the default encoding to be used which is ASCII.", but this is not always true. If you think it should be fixed I'll do it in a separate commit.)
msg100091 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-02-25 16:43
Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > The latest patch (issue7649v4.diff) checks if the char is ASCII or non-ASCII and then, if the char is ASCII, it converts it directly to Unicode, otherwise it tries to decode it using the default encoding, raising a UnicodeDecodeError if the decoding fails. Thanks. The patch looks good now... but doesn't apply cleanly anymore, since your first version has already made it into trunk and the 2.6 branch. > I tested it setting iso-8859-1 and utf-8 as default encoding and the behavior was consistent with "%s", however the tests assume that the default encoding is always ASCII, so they failed (both the tests included in the patch and others in test_unicode). I'm not sure if in this case they should be changed/skipped or not. I think that's fine. While we do still allow setting the default to something other than ASCII in 2.x, we don't support such tricks, so there's no need to test for them. > (Also http://docs.python.org/c-api/unicode.html#built-in-codecs says that "Setting encoding to NULL causes the default encoding to be used which is ASCII.", but this is not always true. If you think it should be fixed I'll do it in a separate commit.) The last part of that sentence should be removed.
msg100092 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-02-25 17:06
The patch should apply cleanly on the latest revision of trunk (if you look at the diff it removes the code of the first version). About the doc, I think it would be better to say that usually it's ASCII (or UTF-8 on Py3 -- Py3 doc should be fixed too), but it might be something different if it's changed using sys.setdefaultencoding().
msg100093 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-02-25 17:13
Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > The patch should apply cleanly on the latest revision of trunk (if you look at the diff it removes the code of the first version). Ah, sorry, my fault. > About the doc, I think it would be better to say that usually it's ASCII (or UTF-8 on Py3 -- Py3 doc should be fixed too), but it might be something different if it's changed using sys.setdefaultencoding(). I guess it's better to open a separate ticket for that.
msg100095 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-02-25 17:58
issue7649v4.diff has been committed in r78449 (trunk) and r78450 (release26-maint).

History
Date	User	Action	Args
2022-04-11 14:56:56	admin	set	github: 51898
2010-02-25 17:58:08	ezio.melotti	set	status: open -> closed resolution: fixed messages: + msg100095 stage: patch review -> resolved
2010-02-25 17:13:29	lemburg	set	messages: + msg100093
2010-02-25 17:06:38	ezio.melotti	set	messages: + msg100092
2010-02-25 16:43:15	lemburg	set	messages: + msg100091
2010-02-25 14:46:19	ezio.melotti	set	files: + issue7649v4.diff messages: + msg100087 stage: resolved -> patch review
2010-02-24 21:40:13	lemburg	set	messages: + msg100070
2010-02-24 18:47:14	ezio.melotti	set	messages: + msg100065
2010-02-24 18:25:59	ezio.melotti	set	messages: + msg100063
2010-02-24 18:17:55	eric.smith	set	messages: + msg100062
2010-02-24 18:09:27	amaury.forgeotdarc	set	messages: + msg100060
2010-02-24 18:03:05	eric.smith	set	messages: + msg100059
2010-02-24 11:07:07	ezio.melotti	set	messages: + msg100028
2010-02-24 11:03:12	amaury.forgeotdarc	set	messages: + msg100027
2010-02-24 10:59:33	ezio.melotti	set	messages: + msg100025
2010-02-24 10:50:15	lemburg	set	messages: + msg100024
2010-02-24 10:39:33	amaury.forgeotdarc	set	messages: + msg100023
2010-02-24 10:02:42	lemburg	set	messages: + msg100021
2010-02-24 09:44:00	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg100018
2010-02-24 08:15:41	lemburg	set	status: closed -> open nosy: + lemburg messages: + msg100010 resolution: fixed -> (no value)
2010-02-24 06:43:35	ezio.melotti	set	stage: patch review -> resolved
2010-02-23 23:23:38	vstinner	set	status: open -> closed resolution: fixed messages: + msg99973
2010-02-23 13:51:36	eric.smith	set	messages: + msg99914
2010-02-23 13:45:41	ezio.melotti	set	files: + issue7649v3.diff messages: + msg99912
2010-02-23 06:18:15	ezio.melotti	set	messages: + msg99900
2010-02-22 22:20:56	eric.smith	set	messages: + msg99856
2010-02-22 19:00:08	ezio.melotti	set	keywords: + needs review files: + issue7649v2.diff messages: + msg99811
2010-01-21 11:43:48	vstinner	set	messages: + msg98108
2010-01-07 13:35:56	ezio.melotti	set	files: + issue7649.diff messages: + msg97354
2010-01-07 13:02:37	flox	set	nosy: + flox messages: + msg97353
2010-01-07 11:55:58	eric.smith	set	messages: + msg97350
2010-01-07 11:49:16	ezio.melotti	set	messages: + msg97349
2010-01-07 02:50:47	eric.smith	set	priority: high -> normal keywords: + easy stage: test needed -> patch review
2010-01-07 02:48:37	eric.smith	set	messages: + msg97340
2010-01-07 02:01:52	vstinner	set	messages: + msg97339
2010-01-07 01:55:45	vstinner	set	files: + unicode_formatchar.patch nosy: + vstinner messages: + msg97338 keywords: + patch
2010-01-07 01:46:04	eric.smith	set	nosy: + doerwalter
2010-01-07 01:41:57	eric.smith	set	messages: + msg97336
2010-01-07 01:41:23	eric.smith	set	nosy: + eric.smith
2010-01-07 01:07:59	ezio.melotti	create