msg115542 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-09-03 23:52 |
I'm trying to document the encoding of all bytes argument of the C API: see #9738. I tried to understand which encoding is used by PyUnicode_FromFormat*() (and PyErr_Format() which calls PyUnicode_FromFormatV()). It looks like ISO-8859-1, see unicodeobject.c near line 1106:
for (f = format; *f; f++) {
if (*f == '%') {
...
} else
*s++ = *f; <~~~~ here
}
... oh wait, it doesn't work for non-ascii text! Test in gdb:
(gdb) print _PyObject_Dump(PyUnicodeUCS2_FromFormat("iso-8859-1:\xd0\xff"))
object : 'iso-8859-1:\uffd0\uffff'
type : str
refcount: 1
address : 0x83d5d80
b'\xd0\xff' is decoded '\uffd0\xffff' :-( It's a bug.
--
PyUnicode_FromFormatV() should raise an error on non-ascii format character, or decode it correctly as... ISO-8859-1 or something else. It's difficult to support multi byte encodings (like utf-8), ISO-8859-1 is fine. If we choose to raise an error, how can the user format a non-ascii string? Using its_unicode_format.format(...arguments...) or its_unicode_format % arguments? Is it easy to call these methods in C?
|
msg115609 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2010-09-04 19:26 |
2 remarks:
- PyUnicode_FromFormat("%s", text) expects a utf-8 buffer.
- Very recently (r84472, r84485), some C files of CPython source code were converted to utf-8. And most of the time, the format given to PyUnicode_FromFormat is a string literal.
So it would make sense for PyUnicode_FromFormat to consider the format string as encoded in utf-8. This is worth asking on python-dev though.
|
msg115820 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-09-07 23:21 |
> PyUnicode_FromFormat("%s", text) expects a utf-8 buffer.
Really? I don't see how "*s++ = *f;" (where s is Py_UNICODE* and f is char*) can decode utf-8. It looks more like ISO-8859-1.
> Very recently (r84472, r84485), some C files of CPython source code
> were converted to utf-8
Python source code (C and Python) is written in ASCII except maybe some headers or some tests written in Python with #coding:xxx header (or without the header, but in utf-8, for Python3). I don't think that a C file calls PyErr_Format() or PyUnicode_FromFormat(V)() with a non-ascii format string.
|
msg115825 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2010-09-07 23:54 |
> > PyUnicode_FromFormat("%s", text) expects a utf-8 buffer.
> Really?
The *format* looks more like latin-1, right. But the payload of a "%s" item is decoded as utf-8.
> I don't think that a C file calls PyErr_Format() or
> PyUnicode_FromFormat(V)() with a non-ascii format string.
At the moment, it's true. My remark is that utf-8 tend to be applied to all kind of files; if someone once decide that non-ascii chars are allowed in (some) string constants, they will be stored in utf-8.
|
msg115889 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-09-08 18:17 |
> My remark is that utf-8 tend to be applied to all kind of files;
> if someone once decide that non-ascii chars are allowed in (some)
> string constants, they will be stored in utf-8.
In this case, it will be better to raise an error on non-ascii byte (character) in the format string. It's better to raise an error than to interpret utf-8 as iso-8859-1 (mojibake!). Since nobody noticed this bug (PyFormat_FromString/PyErr_Format expects ISO-8859-1), I suppose that nobody uses non-ASCII format string is always ascii.
Python builtin errors are not localised. If an application uses gettext, I suppose that the error will be raised in the Python code, not in the C API.
Attached patch changes PyFormat_FromStringV (and so PyFormat_FromString and PyErr_Format) to reject non-ascii byte (character) in the format string. I added a test and documented the format string encoding (which is now ASCII). See also #9738 for the documentation about function argument encoding.
|
msg116045 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-09-10 21:37 |
@amaury: Do you agree to reject non-ascii bytes?
TODO: document format encoding in Doc/c-api/*.rst.
|
msg116046 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2010-09-10 21:52 |
Yes, let's be conservative and reject non-ascii bytes in the format string.
|
msg116071 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-09-11 00:55 |
Fixed by r84704 in Python 3.2.
|
msg121561 - (view) |
Author: Alexander Belopolsky (belopolsky) *  |
Date: 2010-11-19 19:42 |
I don't understand Victor's argument in msg115889. According to UTF-8 RFC, <http://www.ietf.org/rfc/rfc2279.txt>:
- US-ASCII values do not appear otherwise in a UTF-8 encoded
character stream. This provides compatibility with file systems
or other software (e.g. the printf() function in C libraries) that
parse based on US-ASCII values but are transparent to other
values.
This means that printf-like formatters should not care whether the format string is in UTF-8, Latin1, or any other ASCII-compatible 8-bit encoding. (Passing in multibyte encoding pretending to be bytes would of course lead to havoc, but C type system will protect you from that.)
It is also fairly simple to ssnity-check for UTF-8 if necessary, but in case of PyUnicode_FromFormat, the resulting string will be decoded as UTF-8, so all characters in the format string will be checked anyways.
Am I missing something?
|
msg121563 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-11-19 20:06 |
On Friday 19 November 2010 20:42:53 you wrote:
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
>
> I don't understand Victor's argument in msg115889. According to UTF-8 RFC,
> <http://www.ietf.org/rfc/rfc2279.txt>:
>
> - US-ASCII values do not appear otherwise in a UTF-8 encoded
> character stream. This provides compatibility with file systems
> or other software (e.g. the printf() function in C libraries) that
> parse based on US-ASCII values but are transparent to other
> values.
Most C functions including printf works on multi*byte* strings, not on (wide)
character strings. Whereas PyUnicode_FromFormatV() converts the format string
(bytes) to unicode (characters). If you would like a comparaison in C, it's
like printf()+mbstowcs() in the same function.
> This means that printf-like formatters should not care whether the format
> string is in UTF-8, Latin1, or any other ASCII-compatible 8-bit encoding.
It's maybe true with bytes input and bytes output (eg. PyString_FromFormatV()
of Python2), but it's no more true with bytes input and str output (eg.
PyUnicode_FromFormatV() of Python3).
> It is also fairly simple to ssnity-check for UTF-8 if necessary, but in
> case of PyUnicode_FromFormat, the resulting string will be decoded as
> UTF-8, so all characters in the format string will be checked anyways.
I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long (210
lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas ASCII decode
is just: "unicode_char = (Py_UNICODE)byte;" + an if before to check that 0 <=
byte <= 127).
Nobody noticed my change just because the whole Python code base only uses
ASCII argument for the format argument of PyUnicode_FromFormatV().
Victor
|
msg121568 - (view) |
Author: Alexander Belopolsky (belopolsky) *  |
Date: 2010-11-19 20:58 |
On Fri, Nov 19, 2010 at 3:06 PM, STINNER Victor <report@bugs.python.org> wrote:
> .. Whereas PyUnicode_FromFormatV() converts the format string
> (bytes) to unicode (characters). If you would like a comparaison in C, it's
> like printf()+mbstowcs() in the same function.
>
I see. So it is really the
else
*s++ = *f;
that surreptitiously widens the characters.
..
> I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long (210
> lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas ASCII decode
> is just: "unicode_char = (Py_UNICODE)byte;" + an if before to check that 0 <=
> byte <= 127).
I don't think we need 210 lines to replace "*s++ = *f" with proper
UTF-8 logic. Even if we do, the code can be shared with
PyUnicode_DecodeUTF8 and a UTF-8 iterator may be a welcome addition to
Python C API.
|
msg121582 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-11-20 00:15 |
On Friday 19 November 2010 21:58:25 you wrote:
> > I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long
> > (210 lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas
> > ASCII decode is just: "unicode_char = (Py_UNICODE)byte;" + an if before
> > to check that 0 <= byte <= 127).
>
> I don't think we need 210 lines to replace "*s++ = *f" with proper
> UTF-8 logic. Even if we do, the code can be shared with
> PyUnicode_DecodeUTF8 and a UTF-8 iterator may be a welcome addition to
> Python C API.
Why should we do that? ASCII format is just fine. Remember that
PyUnicode_FromFormatV() is part of the C API. I don't think that anyone would
use non-ASCII format in C. If someone does that, (s)he should open a new issue
for that :-) But I don't think that we should make the code more complex if
it's just useless.
Victor
|
msg121693 - (view) |
Author: Alexander Belopolsky (belopolsky) *  |
Date: 2010-11-20 17:38 |
On Fri, Nov 19, 2010 at 7:15 PM, STINNER Victor <report@bugs.python.org> wrote:
..
>
> Why should we do that? ASCII format is just fine. Remember that
> PyUnicode_FromFormatV() is part of the C API. I don't think that anyone would
> use non-ASCII format in C.
Why not. Gettext manual is full of examples with i18nalized format strings.
> If someone does that, (s)he should open a new issue
> for that :-)
Why new issue? The title of this issue fits perfectly and IMO it is
hard to argue that to "handle non-ascii text correctly" means to raise
an error when non-ascii text is encountered.
|
msg129838 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-03-02 00:25 |
I still consider that ASCII format strings should be enough for everyone.
> > If someone does that, (s)he should open a new issue for that :-)
>
> Why new issue?
Ok, so I just remove myself from the nosy list.
|
msg174080 - (view) |
Author: Mariano Reingart (reingart) |
Date: 2012-10-28 21:22 |
(moved from issue #16343)
Working in an internationalization proposal <http://python.org.ar/pyar/TracebackInternationalizationProposal> (issue #16344)
I've stopped at this problem (#9769) where multi byte encodings (like utf-8) is not supported by PyUnicode_FromFormatV()
Beside my proposal, I think utf-8 should be supported for consistency with the other unicode functions, like PyUnicode_FromString() or even unicode_fromformat_arg()
Attached is a patch that:
- enhanced the iterator to detect multibyte sequences, with sanity checks about start & continuation bytes
- replaced unicode_write_cstr with PyUnicode_DecodeUTF8Stateful
- tests
Hope it helps, this is my first patch for cpython and my C skills are a bit rusty, so excuse me if there is any newbie glitch
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:06 | admin | set | github: 53978 |
2015-10-02 21:12:14 | vstinner | set | status: open -> closed resolution: out of date |
2014-06-29 23:59:58 | belopolsky | set | assignee: belopolsky -> versions:
+ Python 3.5, - Python 3.4 |
2012-10-28 21:22:43 | reingart | set | files:
+ pyunicode_fromformat_utf8.patch nosy:
+ reingart messages:
+ msg174080
|
2012-10-28 20:18:19 | chris.jerdonek | set | versions:
+ Python 3.4, - Python 3.3 |
2012-10-28 20:14:48 | chris.jerdonek | link | issue16343 superseder |
2011-03-02 01:08:58 | vstinner | set | nosy:
- vstinner
|
2011-03-02 00:25:43 | vstinner | set | nosy:
amaury.forgeotdarc, belopolsky, vstinner, ezio.melotti messages:
+ msg129838 |
2011-03-02 00:22:11 | belopolsky | set | priority: normal -> low nosy:
amaury.forgeotdarc, belopolsky, vstinner, ezio.melotti |
2011-03-02 00:21:23 | belopolsky | set | assignee: belopolsky type: enhancement resolution: fixed -> (no value) nosy:
amaury.forgeotdarc, belopolsky, vstinner, ezio.melotti versions:
+ Python 3.3, - Python 3.2 |
2010-11-20 17:38:45 | belopolsky | set | messages:
+ msg121693 |
2010-11-20 00:15:55 | vstinner | set | messages:
+ msg121582 |
2010-11-19 20:58:22 | belopolsky | set | messages:
+ msg121568 |
2010-11-19 20:06:13 | vstinner | set | messages:
+ msg121563 |
2010-11-19 19:55:32 | ezio.melotti | set | nosy:
+ ezio.melotti
|
2010-11-19 19:42:51 | belopolsky | set | status: closed -> open nosy:
+ belopolsky messages:
+ msg121561
|
2010-09-11 00:55:11 | vstinner | set | status: open -> closed resolution: fixed messages:
+ msg116071
|
2010-09-10 21:52:12 | amaury.forgeotdarc | set | messages:
+ msg116046 |
2010-09-10 21:37:58 | vstinner | set | messages:
+ msg116045 |
2010-09-08 18:17:16 | vstinner | set | files:
+ pyunicode_fromformat_ascii.patch keywords:
+ patch messages:
+ msg115889
|
2010-09-07 23:54:09 | amaury.forgeotdarc | set | messages:
+ msg115825 |
2010-09-07 23:21:46 | vstinner | set | messages:
+ msg115820 |
2010-09-04 19:26:13 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages:
+ msg115609
|
2010-09-03 23:52:59 | vstinner | create | |