Message121563
On Friday 19 November 2010 20:42:53 you wrote:
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
>
> I don't understand Victor's argument in msg115889. According to UTF-8 RFC,
> <http://www.ietf.org/rfc/rfc2279.txt>:
>
> - US-ASCII values do not appear otherwise in a UTF-8 encoded
> character stream. This provides compatibility with file systems
> or other software (e.g. the printf() function in C libraries) that
> parse based on US-ASCII values but are transparent to other
> values.
Most C functions including printf works on multi*byte* strings, not on (wide)
character strings. Whereas PyUnicode_FromFormatV() converts the format string
(bytes) to unicode (characters). If you would like a comparaison in C, it's
like printf()+mbstowcs() in the same function.
> This means that printf-like formatters should not care whether the format
> string is in UTF-8, Latin1, or any other ASCII-compatible 8-bit encoding.
It's maybe true with bytes input and bytes output (eg. PyString_FromFormatV()
of Python2), but it's no more true with bytes input and str output (eg.
PyUnicode_FromFormatV() of Python3).
> It is also fairly simple to ssnity-check for UTF-8 if necessary, but in
> case of PyUnicode_FromFormat, the resulting string will be decoded as
> UTF-8, so all characters in the format string will be checked anyways.
I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long (210
lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas ASCII decode
is just: "unicode_char = (Py_UNICODE)byte;" + an if before to check that 0 <=
byte <= 127).
Nobody noticed my change just because the whole Python code base only uses
ASCII argument for the format argument of PyUnicode_FromFormatV().
Victor |
|
Date |
User |
Action |
Args |
2010-11-19 20:06:23 | vstinner | set | recipients:
+ vstinner, amaury.forgeotdarc, belopolsky, ezio.melotti |
2010-11-19 20:06:13 | vstinner | link | issue9769 messages |
2010-11-19 20:06:13 | vstinner | create | |
|