This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: PyUnicode_FromFormatV() doesn't handle non-ascii text correctly
Type: enhancement Stage:
Components: Interpreter Core, Unicode Versions: Python 3.5
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, belopolsky, ezio.melotti, reingart
Priority: low Keywords: patch

Created on 2010-09-03 23:52 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
pyunicode_fromformat_ascii.patch vstinner, 2010-09-08 18:17
pyunicode_fromformat_utf8.patch reingart, 2012-10-28 21:22 PyUnicode_FromFormatV patch to use UTF-8 review
Messages (15)
msg115542 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-03 23:52
I'm trying to document the encoding of all bytes argument of the C API: see #9738. I tried to understand which encoding is used by PyUnicode_FromFormat*() (and PyErr_Format() which calls PyUnicode_FromFormatV()). It looks like ISO-8859-1, see unicodeobject.c near line 1106:

    for (f = format; *f; f++) {
        if (*f == '%') {
            ...
        } else
            *s++ = *f; <~~~~ here
    }

... oh wait, it doesn't work for non-ascii text! Test in gdb:

(gdb) print _PyObject_Dump(PyUnicodeUCS2_FromFormat("iso-8859-1:\xd0\xff"))
object  : 'iso-8859-1:\uffd0\uffff'
type    : str
refcount: 1
address : 0x83d5d80

b'\xd0\xff' is decoded '\uffd0\xffff' :-( It's a bug.

--

PyUnicode_FromFormatV() should raise an error on non-ascii format character, or decode it correctly as... ISO-8859-1 or something else. It's difficult to support multi byte encodings (like utf-8), ISO-8859-1 is fine. If we choose to raise an error, how can the user format a non-ascii string? Using its_unicode_format.format(...arguments...) or its_unicode_format % arguments? Is it easy to call these methods in C?
msg115609 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-09-04 19:26
2 remarks: 
- PyUnicode_FromFormat("%s", text) expects a utf-8 buffer.
- Very recently (r84472, r84485), some C files of CPython source code were converted to utf-8.  And most of the time, the format given to PyUnicode_FromFormat is a string literal.

So it would make sense for PyUnicode_FromFormat to consider the format string as encoded in utf-8.  This is worth asking on python-dev though.
msg115820 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-07 23:21
> PyUnicode_FromFormat("%s", text) expects a utf-8 buffer.

Really? I don't see how "*s++ = *f;" (where s is Py_UNICODE* and f is char*) can decode utf-8. It looks more like ISO-8859-1.

> Very recently (r84472, r84485), some C files of CPython source code
> were converted to utf-8

Python source code (C and Python) is written in ASCII except maybe some headers or some tests written in Python with #coding:xxx header (or without the header, but in utf-8, for Python3). I don't think that a C file calls PyErr_Format() or PyUnicode_FromFormat(V)() with a non-ascii format string.
msg115825 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-09-07 23:54
> > PyUnicode_FromFormat("%s", text) expects a utf-8 buffer.
> Really?

The *format* looks more like latin-1, right. But the payload of a "%s" item is decoded as utf-8.

> I don't think that a C file calls PyErr_Format() or
> PyUnicode_FromFormat(V)() with a non-ascii format string.

At the moment, it's true. My remark is that utf-8 tend to be applied to all kind of files; if someone once decide that non-ascii chars are allowed in (some) string constants, they will be stored in utf-8.
msg115889 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-08 18:17
> My remark is that utf-8 tend to be applied to all kind of files;
> if someone once decide that non-ascii chars are allowed in (some) 
> string constants, they will be stored in utf-8.

In this case, it will be better to raise an error on non-ascii byte (character) in the format string. It's better to raise an error than to interpret utf-8 as iso-8859-1 (mojibake!). Since nobody noticed this bug (PyFormat_FromString/PyErr_Format expects ISO-8859-1), I suppose that nobody uses non-ASCII format string is always ascii.

Python builtin errors are not localised. If an application uses gettext, I suppose that the error will be raised in the Python code, not in the C API.

Attached patch changes PyFormat_FromStringV (and so PyFormat_FromString and PyErr_Format) to reject non-ascii byte (character) in the format string. I added a test and documented the format string encoding (which is now ASCII). See also #9738 for the documentation about function argument encoding.
msg116045 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-10 21:37
@amaury: Do you agree to reject non-ascii bytes?

TODO: document format encoding in Doc/c-api/*.rst.
msg116046 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-09-10 21:52
Yes, let's be conservative and reject non-ascii bytes in the format string.
msg116071 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-11 00:55
Fixed by r84704 in Python 3.2.
msg121561 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-19 19:42
I don't understand Victor's argument in msg115889.  According to UTF-8 RFC, <http://www.ietf.org/rfc/rfc2279.txt>:

   -  US-ASCII values do not appear otherwise in a UTF-8 encoded
      character stream.  This provides compatibility with file systems
      or other software (e.g. the printf() function in C libraries) that
      parse based on US-ASCII values but are transparent to other
      values.

This means that printf-like formatters should not care whether the format string is in UTF-8, Latin1, or any other ASCII-compatible 8-bit encoding.  (Passing in multibyte encoding pretending to be bytes would of course lead to havoc, but C type system will protect you from that.)

It is also fairly simple to ssnity-check for UTF-8 if necessary, but in case of PyUnicode_FromFormat, the resulting string will be decoded as UTF-8, so all characters in the format string will be checked anyways.

Am I missing something?
msg121563 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-11-19 20:06
On Friday 19 November 2010 20:42:53 you wrote:
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
> 
> I don't understand Victor's argument in msg115889.  According to UTF-8 RFC,
> <http://www.ietf.org/rfc/rfc2279.txt>:
> 
>    -  US-ASCII values do not appear otherwise in a UTF-8 encoded
>       character stream.  This provides compatibility with file systems
>       or other software (e.g. the printf() function in C libraries) that
>       parse based on US-ASCII values but are transparent to other
>       values.

Most C functions including printf works on multi*byte* strings, not on (wide) 
character strings. Whereas PyUnicode_FromFormatV() converts the format string 
(bytes) to unicode (characters). If you would like a comparaison in C, it's 
like printf()+mbstowcs() in the same function.

> This means that printf-like formatters should not care whether the format
> string is in UTF-8, Latin1, or any other ASCII-compatible 8-bit encoding. 

It's maybe true with bytes input and bytes output (eg. PyString_FromFormatV() 
of Python2), but it's no more true with bytes input and str output (eg. 
PyUnicode_FromFormatV() of Python3).

> It is also fairly simple to ssnity-check for UTF-8 if necessary, but in
> case of PyUnicode_FromFormat, the resulting string will be decoded as
> UTF-8, so all characters in the format string will be checked anyways.

I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long (210 
lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas ASCII decode 
is just: "unicode_char = (Py_UNICODE)byte;" + an if before to check that 0 <= 
byte <= 127).

Nobody noticed my change just because the whole Python code base only uses 
ASCII argument for the format argument of PyUnicode_FromFormatV().

Victor
msg121568 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-19 20:58
On Fri, Nov 19, 2010 at 3:06 PM, STINNER Victor <report@bugs.python.org> wrote:
> .. Whereas PyUnicode_FromFormatV() converts the format string
> (bytes) to unicode (characters). If you would like a comparaison in C, it's
> like printf()+mbstowcs() in the same function.
>

I see.  So it is really the

        else
            *s++ = *f;

that surreptitiously widens the characters.

..
> I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long (210
> lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas ASCII decode
> is just: "unicode_char = (Py_UNICODE)byte;" + an if before to check that 0 <=
> byte <= 127).

I don't think we need 210 lines to replace "*s++ = *f" with proper
UTF-8 logic.  Even if we do, the code can be shared with
PyUnicode_DecodeUTF8 and a UTF-8 iterator may be a welcome addition to
Python C API.
msg121582 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-11-20 00:15
On Friday 19 November 2010 21:58:25 you wrote:
> > I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long
> > (210 lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas
> > ASCII decode is just: "unicode_char = (Py_UNICODE)byte;" + an if before
> > to check that 0 <= byte <= 127).
> 
> I don't think we need 210 lines to replace "*s++ = *f" with proper
> UTF-8 logic.  Even if we do, the code can be shared with
> PyUnicode_DecodeUTF8 and a UTF-8 iterator may be a welcome addition to
> Python C API.

Why should we do that? ASCII format is just fine. Remember that 
PyUnicode_FromFormatV() is part of the C API. I don't think that anyone would 
use non-ASCII format in C. If someone does that, (s)he should open a new issue 
for that :-) But I don't think that we should make the code more complex if 
it's just useless.

Victor
msg121693 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-20 17:38
On Fri, Nov 19, 2010 at 7:15 PM, STINNER Victor <report@bugs.python.org> wrote:
..
>
> Why should we do that? ASCII format is just fine. Remember that
> PyUnicode_FromFormatV() is part of the C API. I don't think that anyone would
> use non-ASCII format in C.

Why not.  Gettext manual is full of examples with i18nalized format strings.

> If someone does that, (s)he should open a new issue
> for that :-)

Why new issue?  The title of this issue fits perfectly and IMO it is
hard to argue that to "handle non-ascii text correctly" means to raise
an error when non-ascii text is encountered.
msg129838 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-03-02 00:25
I still consider that ASCII format strings should be enough for everyone.

> > If someone does that, (s)he should open a new issue for that :-)
>
> Why new issue?

Ok, so I just remove myself from the nosy list.
msg174080 - (view) Author: Mariano Reingart (reingart) Date: 2012-10-28 21:22
(moved from issue #16343)

Working in an internationalization proposal <http://python.org.ar/pyar/TracebackInternationalizationProposal> (issue #16344)
I've stopped at this problem (#9769) where multi byte encodings (like utf-8) is not supported by PyUnicode_FromFormatV()

Beside my proposal, I think utf-8 should be supported for consistency with the other unicode functions, like PyUnicode_FromString() or even unicode_fromformat_arg()

Attached is a patch that:
- enhanced the iterator to detect multibyte sequences, with sanity checks about start & continuation bytes
- replaced unicode_write_cstr with PyUnicode_DecodeUTF8Stateful
- tests

Hope it helps, this is my first patch for cpython and my C skills are a bit rusty, so excuse me if there is any newbie glitch
History
Date User Action Args
2022-04-11 14:57:06adminsetgithub: 53978
2015-10-02 21:12:14vstinnersetstatus: open -> closed
resolution: out of date
2014-06-29 23:59:58belopolskysetassignee: belopolsky ->
versions: + Python 3.5, - Python 3.4
2012-10-28 21:22:43reingartsetfiles: + pyunicode_fromformat_utf8.patch
nosy: + reingart
messages: + msg174080

2012-10-28 20:18:19chris.jerdoneksetversions: + Python 3.4, - Python 3.3
2012-10-28 20:14:48chris.jerdoneklinkissue16343 superseder
2011-03-02 01:08:58vstinnersetnosy: - vstinner
2011-03-02 00:25:43vstinnersetnosy: amaury.forgeotdarc, belopolsky, vstinner, ezio.melotti
messages: + msg129838
2011-03-02 00:22:11belopolskysetpriority: normal -> low
nosy: amaury.forgeotdarc, belopolsky, vstinner, ezio.melotti
2011-03-02 00:21:23belopolskysetassignee: belopolsky
type: enhancement
resolution: fixed -> (no value)
nosy: amaury.forgeotdarc, belopolsky, vstinner, ezio.melotti
versions: + Python 3.3, - Python 3.2
2010-11-20 17:38:45belopolskysetmessages: + msg121693
2010-11-20 00:15:55vstinnersetmessages: + msg121582
2010-11-19 20:58:22belopolskysetmessages: + msg121568
2010-11-19 20:06:13vstinnersetmessages: + msg121563
2010-11-19 19:55:32ezio.melottisetnosy: + ezio.melotti
2010-11-19 19:42:51belopolskysetstatus: closed -> open
nosy: + belopolsky
messages: + msg121561

2010-09-11 00:55:11vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg116071
2010-09-10 21:52:12amaury.forgeotdarcsetmessages: + msg116046
2010-09-10 21:37:58vstinnersetmessages: + msg116045
2010-09-08 18:17:16vstinnersetfiles: + pyunicode_fromformat_ascii.patch
keywords: + patch
messages: + msg115889
2010-09-07 23:54:09amaury.forgeotdarcsetmessages: + msg115825
2010-09-07 23:21:46vstinnersetmessages: + msg115820
2010-09-04 19:26:13amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg115609
2010-09-03 23:52:59vstinnercreate