Message 115542 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	vstinner
Date	2010-09-03.23:52:59
SpamBayes Score	1.0870227e-08
Marked as misclassified	No
Message-id	<1283557981.28.0.85027629308.issue9769@psf.upfronthosting.co.za>
In-reply-to

Content
I'm trying to document the encoding of all bytes argument of the C API: see #9738. I tried to understand which encoding is used by PyUnicode_FromFormat() (and PyErr_Format() which calls PyUnicode_FromFormatV()). It looks like ISO-8859-1, see unicodeobject.c near line 1106: for (f = format; f; f++) { if (f == '%') { ... } else s++ = *f; <~~~~ here } ... oh wait, it doesn't work for non-ascii text! Test in gdb: (gdb) print _PyObject_Dump(PyUnicodeUCS2_FromFormat("iso-8859-1:\xd0\xff")) object : 'iso-8859-1:\uffd0\uffff' type : str refcount: 1 address : 0x83d5d80 b'\xd0\xff' is decoded '\uffd0\xffff' :-( It's a bug. -- PyUnicode_FromFormatV() should raise an error on non-ascii format character, or decode it correctly as... ISO-8859-1 or something else. It's difficult to support multi byte encodings (like utf-8), ISO-8859-1 is fine. If we choose to raise an error, how can the user format a non-ascii string? Using its_unicode_format.format(...arguments...) or its_unicode_format % arguments? Is it easy to call these methods in C?

I'm trying to document the encoding of all bytes argument of the C API: see #9738. I tried to understand which encoding is used by PyUnicode_FromFormat*() (and PyErr_Format() which calls PyUnicode_FromFormatV()). It looks like ISO-8859-1, see unicodeobject.c near line 1106:

    for (f = format; *f; f++) {
        if (*f == '%') {
            ...
        } else
            *s++ = *f; <~~~~ here
    }

... oh wait, it doesn't work for non-ascii text! Test in gdb:

(gdb) print _PyObject_Dump(PyUnicodeUCS2_FromFormat("iso-8859-1:\xd0\xff"))
object  : 'iso-8859-1:\uffd0\uffff'
type    : str
refcount: 1
address : 0x83d5d80

b'\xd0\xff' is decoded '\uffd0\xffff' :-( It's a bug.

--

PyUnicode_FromFormatV() should raise an error on non-ascii format character, or decode it correctly as... ISO-8859-1 or something else. It's difficult to support multi byte encodings (like utf-8), ISO-8859-1 is fine. If we choose to raise an error, how can the user format a non-ascii string? Using its_unicode_format.format(...arguments...) or its_unicode_format % arguments? Is it easy to call these methods in C?

History
Date	User	Action	Args
2010-09-03 23:53:01	vstinner	set	recipients: + vstinner
2010-09-03 23:53:01	vstinner	set	messageid: <1283557981.28.0.85027629308.issue9769@psf.upfronthosting.co.za>
2010-09-03 23:52:59	vstinner	link	issue9769 messages
2010-09-03 23:52:59	vstinner	create