This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: "u'%c' % char" broken for chars in range '\x80'-'\xFF'
Type: behavior Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: amaury.forgeotdarc, doerwalter, eric.smith, ezio.melotti, flox, lemburg, vstinner
Priority: normal Keywords: easy, needs review, patch

Created on 2010-01-07 01:07 by ezio.melotti, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
unicode_formatchar.patch vstinner, 2010-01-07 01:55
issue7649.diff ezio.melotti, 2010-01-07 13:35 Patch to raise an error + unittests
issue7649v2.diff ezio.melotti, 2010-02-22 19:00 Patch to raise an error + unittests
issue7649v3.diff ezio.melotti, 2010-02-23 13:45 Patch to raise an error + unittests
issue7649v4.diff ezio.melotti, 2010-02-25 14:46 Patch to raise an error + unittests
Messages (35)
msg97334 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-01-07 01:07
On Py2.x u'%c' % char returns the wrong result if char is in range '\x80'-'\xFF':
>>> u'%c' % '\x7f'
u'\x7f'  # correct
>>> u'%c' % '\x80'
u'\uff80'  # broken
>>> u'%c' % '\xFF'
u'\uffff'  # broken

This is what happens whit %s and 0x80:
>>> u'%s' % '\x80'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal
not in range(128)
>>> u'%c' % 0x80
u'\x80'

u'%c' % '\x80' should raise the same error raised by u'%s' % '\x80'.
msg97336 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2010-01-07 01:41
This looks like a signed vs. unsigned char problem. What platform are you on?
msg97338 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-01-07 01:55
The problem is the cast char => Py_UNICODE in formatchar() function. char may be unsigned, but most of the time it's signed. signed => unsigned cast extend the sign. Example: (unsigned short)(signed char)0x80 gives 0xFF80.

Attached patch fixes the issue and includes an unit test.
msg97339 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-01-07 02:01
The problem is specific to Python 2.x. With Python3, "%c" expects one unicode character (eg. "a").

My patch fixes the char => Py_UNICODE conversion, but raising an error is maybe better to be consistent with u"%s" % "\x80" (and prepare the migration the Python3).
msg97340 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2010-01-07 02:48
Shouldn't it work the same as it does for integers?

>>> u'%c' % 0x7f
u'\x7f'
>>> u'%c' % '\x7f'
u'\x7f'
>>> u'%c' % 0x80
u'\x80'
>>> u'%c' % '\x80'
u'\uff80'

That would imply to me it shouldn't be an error, it should just return u'\x80'.
msg97349 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-01-07 11:49
If we allow it to work on 2.7 the code will break:
1) when ported to Py3, where mixing bytes strings and unicode is not allowed;
2) when used on Py<2.7, where the behavior is broken;
3) when converted to str.format, where only ints are accepted.

If we raise an error, the user will have to either use unicode strings or ints and he will avoid further problems. Moreover, the fact that no one noticed this problems means that is not a common operation, so no one probably expects it to work anyway and raising an error could even help to find the problem if someone used %c in older versions.
msg97350 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2010-01-07 11:55
There's no perfect answer. And since we've gotten this far without anyone caring, and 2.7 is basically the end of life for this issue, perhaps doing nothing is the best course. Any change we make will affect code that runs in both 2.6 and 2.7, doing nothing preserves compatibility on the 2.x side.
msg97353 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-01-07 13:02
Tested on 2.5...

~ $ python2.5
Python 2.5.2 (r252:60911, Jan  4 2009, 21:59:32) 
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'%c' % '\x80'
u'\Uffffff80'
>>> 

On 2.7 (trunk) I get same behaviour as described on msg97334.
msg97354 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-01-07 13:35
First patch that makes u'%c' % '\x80' raise a UnicodeDecodeError.

I could reproduce the problem on Linux 32/64bit, Windows 32bit and Python from 2.4 to 2.7. The '\Uffffff80' is returned by wide builds.
msg98108 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-01-21 11:43
@Ezio: Your patch leaks a reference: PyUnicode_FromString(...) is not destroyed (Py_DECREF) on success.
msg99811 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-02-22 19:00
New patch (issue7649v2.diff) with refleak fixed and improved unittests.
msg99856 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2010-02-22 22:20
Could you add a comment on why you're calling PyUnicode_FromString and then throwing away the result? I believe it's so you'll get the same error checking as PyUnicode_FromString, but it's sufficiently tricky that I think it deserves a comment.

Now that I think about it, having done the conversion in PyUnicode_FromString, why not use:
buf[0] = PyUnicode_AS_UNICODE(s)[0];
? This way you skip the cast and you know for a fact you're using the same interpretation as PyUnicode_FromString, having used its output.

Also, since you know the size you may as well call PyUnicode_FromStringAndSize. It's trivially faster and makes the intent clearer, I think.
msg99900 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-02-23 06:18
Indeed some comments would be helpful, I'll add them.
I also tried already to reuse 's' and extract the first char using unicode_getitem, but it returns a PyObject and anyway it's probably more expensive/complicated than calling PyString_AS_STRING again. I'll try with [0] and/or with PyUnicode_FromStringAndSize and make a new patch this afternoon.
msg99912 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-02-23 13:45
Here's a new patch that uses PyUnicode_FromStringAndSize and PyUnicode_AS_UNICODE(s)[0]. I also added some comments and tested it and it seems to work fine.
msg99914 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2010-02-23 13:51
The patch looks good to me, in that it implements your desired functionality well.

I haven't been following the issue closely enough to know whether or not this new functionality is the right way to go. (I'm not saying it's not, just that I don't have an opinion on it.)
msg99973 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-02-23 23:23
Commited: r78392 (r78394).
msg100010 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-02-24 08:15
> --- python/trunk/Objects/unicodeobject.c	(original)
> > +++ python/trunk/Objects/unicodeobject.c	Wed Feb 24 00:16:07 2010
> > @@ -8170,6 +8170,7 @@
> >             size_t buflen,
> >             PyObject *v)
> >  {
> > +    PyObject *s;
> >      /* presume that the buffer is at least 2 characters long */
> >      if (PyUnicode_Check(v)) {
> >          if (PyUnicode_GET_SIZE(v) != 1)
> > @@ -8180,7 +8181,14 @@
> >      else if (PyString_Check(v)) {
> >          if (PyString_GET_SIZE(v) != 1)
> >              goto onError;
> > -        buf[0] = (Py_UNICODE)PyString_AS_STRING(v)[0];
> > +        /* #7649: if the char is a non-ascii (i.e. in range(0x80,0x100)) byte
> > +           string, "u'%c' % char" should fail with a UnicodeDecodeError */
> > +        s = PyUnicode_FromStringAndSize(PyString_AS_STRING(v), 1);
> > +        /* if the char is not decodable return -1 */
> > +        if (s == NULL)
> > +            return -1;
> > +        buf[0] = PyUnicode_AS_UNICODE(s)[0];
> > +        Py_DECREF(s);

That's a *very* inefficient way of doing this.

Could you please check for chars above 0x7f first and then use
PyUnicode_Decode() instead of the PyUnicode_FromStringAndSize()
API (this API should not have been backported from the Python 3.x
in Python 2.6, but it's too late to change that now) !

Thanks.
msg100018 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-02-24 09:44
> Could you please check for chars above 0x7f first and then use
> PyUnicode_Decode() instead of the PyUnicode_FromStringAndSize() API

I concur: PyUnicode_FromStringAndSize() decodes with utf-8 whereas the expected conversion char->unicode should use the default encoding (ascii).
But why is it necessary to check for chars above 0x7f?

> (this API should not have been backported from the Python 3.x
> in Python 2.6,
This function is still useful when the chars come from a C string literal in the source code (btw there should be something about the encoding used in C files). But it's not always correctly used even in 3.x, in posixmodule.c for example.
msg100021 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-02-24 10:02
Amaury Forgeot d'Arc wrote:
> 
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
> 
>> Could you please check for chars above 0x7f first and then use
>> PyUnicode_Decode() instead of the PyUnicode_FromStringAndSize() API
> 
> I concur: PyUnicode_FromStringAndSize() decodes with utf-8 whereas the expected conversion char->unicode should use the default encoding (ascii).
> But why is it necessary to check for chars above 0x7f?

The Python default encoding has to be ASCII compatible,
so it's better to use a short-cut for pure-ASCII characters
and avoid the complete round-trip via a temporary Unicode
object.

>> (this API should not have been backported from the Python 3.x
>> in Python 2.6,
> This function is still useful when the chars come from a C string literal in the source code (btw there should be something about the encoding used in C files). But it's not always correctly used even in 3.x, in posixmodule.c for example.

The function is a really just yet another interface to the
PyUnicode_DecodeUTF8() API and it's name is misleading in that:

Python 2.x uses the default encoding for converting strings without
known encoding to Unicode, the docs for the API say that
it decodes Latin-1 (!) and the interface makes it looks like
a drop-in replacement for PyString_FromStringAndSize() which
it isn't for Python 2.x.

For Python 3.x, the default encoding is fixed to UTF-8, so the
situation is different (though the docs are still wrong),
however I don't see the advantage of using a less explicit
name over the direct use of PyUnicode_DecodeUTF8().
msg100023 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-02-24 10:39
> > But why is it necessary to check for chars above 0x7f?
> The Python default encoding has to be ASCII compatible,
Yes, but it is not necessarily as strict.
for example, after I manage to set the default encoding to latin-1,
u"%s" % chr(0x80) works; I suppose %c should do the same.
msg100024 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-02-24 10:50
Amaury Forgeot d'Arc wrote:
> 
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
> 
>>> But why is it necessary to check for chars above 0x7f?
>> The Python default encoding has to be ASCII compatible,
> Yes, but it is not necessarily as strict.
> for example, after I manage to set the default encoding to latin-1,
> u"%s" % chr(0x80) works; I suppose %c should do the same.

Right and it will.

I only meant the check as fast-path for characters <0x80. All
other characters would need to go through PyUnicode_Decode().

We can use such a fast-path, since the default encoding has to
be ASCII compatible.
msg100025 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-02-24 10:59
I was trying to decode mainly to get a UnicodeDecodeError automatically.
If I check if the char is not in the ASCII range (i.e. >0x7F) I think I'd have to set the error message for the UnicodeDecodeError manually and possibly duplicate it (unless we use a different error message that says that %c accepts only ASCII chars).
Also I agree that if u'%s' % chr(0x80) works when the default encoding is not ASCII, then %c should work as well. Trying to decode it with the default encoding and possibly let the UnicodeDecodeError propagate seems a good solution to me (and performance shouldn't be a problem here, since apparently no one uses u'%c' with non-ASCII byte strings).
I will try to make another patch.
msg100027 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-02-24 11:03
Marc-André's remark was that if char<0x80 (the vast majority), it's not necessary to call any decode function: just copy the byte. Other cases (error, or non-default encoding) may use a slower path.
msg100028 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-02-24 11:07
Ok, adding a fast path shouldn't make the code more complicated, so I'll include it in the patch, thanks for the feedback.
msg100059 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2010-02-24 18:03
I'm working on a similar issue for int.__format__('c'). What's not clear to me is why this doesn't work the same as chr(i).  That is, shouldn't chr(i) == ('%c' % i) hold for i in range(256)? And if that's so, why not just copy chr's implementation:
	if (x < 0 || x >= 256) {
		PyErr_SetString(PyExc_ValueError,
				"chr() arg not in range(256)");
		return NULL;
	}
	s[0] = (char)x;
	return PyString_FromStringAndSize(s, 1);
msg100060 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-02-24 18:09
The operation here is different:
u'%c' % chr(i), which involves a str->unicode conversion.
msg100062 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2010-02-24 18:17
Of course. Sorry about that.

But then why not copy unichr()? It calls PyUnicode_FromOrdinal.

I guess my real question is:

Should "'%c' % i" be identical to "chr(i)" and should "u'%c' % i" be identical to "unichr(i)"?

And by "identical" I mean return the same results and throw the same errors (with a possible difference in the error text).
msg100063 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-02-24 18:25
At least from point of view, the difference between ints and chars is:
* "u'%c' % 0xB5" means "create a Unicode char with codepoint 0xB5", i.e. the Unicode char µ U+00B5 MICRO SIGN;
* "u'%c' % '\xB5'" means "create a Unicode char converting the byte '\xB5' to Unicode", i.e. an Е U+0415 CYRILLIC CAPITAL LETTER IE if the byte string is encoded in iso-8859-5 or ต U+0E15 THAI CHARACTER TO TAO if it's encoded in iso-8859-11, and so on for every different encoding.
msg100065 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-02-24 18:47
Just to clarify:
"u'%c' % some_int" should behave like unichr(some_int);
"u'%c' % some_chr" should behave like "u'%s' % some_chr".
msg100070 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-02-24 21:40
Ezio Melotti wrote:
> Just to clarify:
> "u'%c' % some_int" should behave like unichr(some_int);
> "u'%c' % some_chr" should behave like "u'%s' % some_chr".

That's correct.

I guess that in practice "u'%c' % some_chr" is not all that common.
Otherwise, the fact that it doesn't raise an error if some_chr
is not compatible to the default encoding would have been noticed
earlier on.
msg100087 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-02-25 14:46
The latest patch (issue7649v4.diff) checks if the char is ASCII or non-ASCII and then, if the char is ASCII, it converts it directly to Unicode, otherwise it tries to decode it using the default encoding, raising a UnicodeDecodeError if the decoding fails.

I tested it setting iso-8859-1 and utf-8 as default encoding and the behavior was consistent with "%s", however the tests assume that the default encoding is always ASCII, so they failed (both the tests included in the patch and others in test_unicode). I'm not sure if in this case they should be changed/skipped or not.

(Also http://docs.python.org/c-api/unicode.html#built-in-codecs says that "Setting encoding to NULL causes the default encoding to be used which is ASCII.", but this is not always true. If you think it should be fixed I'll do it in a separate commit.)
msg100091 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-02-25 16:43
Ezio Melotti wrote:
> 
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
> 
> The latest patch (issue7649v4.diff) checks if the char is ASCII or non-ASCII and then, if the char is ASCII, it converts it directly to Unicode, otherwise it tries to decode it using the default encoding, raising a UnicodeDecodeError if the decoding fails.

Thanks. The patch looks good now... but doesn't apply cleanly anymore,
since your first version has already made it into trunk and the 2.6 branch.

> I tested it setting iso-8859-1 and utf-8 as default encoding and the behavior was consistent with "%s", however the tests assume that the default encoding is always ASCII, so they failed (both the tests included in the patch and others in test_unicode). I'm not sure if in this case they should be changed/skipped or not.

I think that's fine. While we do still allow setting the default
to something other than ASCII in 2.x, we don't support such tricks,
so there's no need to test for them.

> (Also http://docs.python.org/c-api/unicode.html#built-in-codecs says that "Setting encoding to NULL causes the default encoding to be used which is ASCII.", but this is not always true. If you think it should be fixed I'll do it in a separate commit.)

The last part of that sentence should be removed.
msg100092 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-02-25 17:06
The patch should apply cleanly on the latest revision of trunk (if you look at the diff it removes the code of the first version).

About the doc, I think it would be better to say that usually it's ASCII (or UTF-8 on Py3 -- Py3 doc should be fixed too), but it might be something different if it's changed using sys.setdefaultencoding().
msg100093 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-02-25 17:13
Ezio Melotti wrote:
> 
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
> 
> The patch should apply cleanly on the latest revision of trunk (if you look at the diff it removes the code of the first version).

Ah, sorry, my fault.

> About the doc, I think it would be better to say that usually it's ASCII (or UTF-8 on Py3 -- Py3 doc should be fixed too), but it might be something different if it's changed using sys.setdefaultencoding().

I guess it's better to open a separate ticket for that.
msg100095 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-02-25 17:58
issue7649v4.diff has been committed in r78449 (trunk) and r78450 (release26-maint).
History
Date User Action Args
2022-04-11 14:56:56adminsetgithub: 51898
2010-02-25 17:58:08ezio.melottisetstatus: open -> closed
resolution: fixed
messages: + msg100095

stage: patch review -> resolved
2010-02-25 17:13:29lemburgsetmessages: + msg100093
2010-02-25 17:06:38ezio.melottisetmessages: + msg100092
2010-02-25 16:43:15lemburgsetmessages: + msg100091
2010-02-25 14:46:19ezio.melottisetfiles: + issue7649v4.diff

messages: + msg100087
stage: resolved -> patch review
2010-02-24 21:40:13lemburgsetmessages: + msg100070
2010-02-24 18:47:14ezio.melottisetmessages: + msg100065
2010-02-24 18:25:59ezio.melottisetmessages: + msg100063
2010-02-24 18:17:55eric.smithsetmessages: + msg100062
2010-02-24 18:09:27amaury.forgeotdarcsetmessages: + msg100060
2010-02-24 18:03:05eric.smithsetmessages: + msg100059
2010-02-24 11:07:07ezio.melottisetmessages: + msg100028
2010-02-24 11:03:12amaury.forgeotdarcsetmessages: + msg100027
2010-02-24 10:59:33ezio.melottisetmessages: + msg100025
2010-02-24 10:50:15lemburgsetmessages: + msg100024
2010-02-24 10:39:33amaury.forgeotdarcsetmessages: + msg100023
2010-02-24 10:02:42lemburgsetmessages: + msg100021
2010-02-24 09:44:00amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg100018
2010-02-24 08:15:41lemburgsetstatus: closed -> open

nosy: + lemburg
messages: + msg100010

resolution: fixed -> (no value)
2010-02-24 06:43:35ezio.melottisetstage: patch review -> resolved
2010-02-23 23:23:38vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg99973
2010-02-23 13:51:36eric.smithsetmessages: + msg99914
2010-02-23 13:45:41ezio.melottisetfiles: + issue7649v3.diff

messages: + msg99912
2010-02-23 06:18:15ezio.melottisetmessages: + msg99900
2010-02-22 22:20:56eric.smithsetmessages: + msg99856
2010-02-22 19:00:08ezio.melottisetkeywords: + needs review
files: + issue7649v2.diff
messages: + msg99811
2010-01-21 11:43:48vstinnersetmessages: + msg98108
2010-01-07 13:35:56ezio.melottisetfiles: + issue7649.diff

messages: + msg97354
2010-01-07 13:02:37floxsetnosy: + flox
messages: + msg97353
2010-01-07 11:55:58eric.smithsetmessages: + msg97350
2010-01-07 11:49:16ezio.melottisetmessages: + msg97349
2010-01-07 02:50:47eric.smithsetpriority: high -> normal
keywords: + easy
stage: test needed -> patch review
2010-01-07 02:48:37eric.smithsetmessages: + msg97340
2010-01-07 02:01:52vstinnersetmessages: + msg97339
2010-01-07 01:55:45vstinnersetfiles: + unicode_formatchar.patch

nosy: + vstinner
messages: + msg97338

keywords: + patch
2010-01-07 01:46:04eric.smithsetnosy: + doerwalter
2010-01-07 01:41:57eric.smithsetmessages: + msg97336
2010-01-07 01:41:23eric.smithsetnosy: + eric.smith
2010-01-07 01:07:59ezio.melotticreate