Message 121563 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	amaury.forgeotdarc, belopolsky, ezio.melotti, vstinner
Date	2010-11-19.20:06:13
SpamBayes Score	3.330669e-16
Marked as misclassified	No
Message-id	<201011192106.06983.victor.stinner@haypocalc.com>
In-reply-to	<1290195773.07.0.831312765946.issue9769@psf.upfronthosting.co.za>

Content
On Friday 19 November 2010 20:42:53 you wrote: > Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: > > I don't understand Victor's argument in msg115889. According to UTF-8 RFC, > <http://www.ietf.org/rfc/rfc2279.txt>: > > - US-ASCII values do not appear otherwise in a UTF-8 encoded > character stream. This provides compatibility with file systems > or other software (e.g. the printf() function in C libraries) that > parse based on US-ASCII values but are transparent to other > values. Most C functions including printf works on multibyte strings, not on (wide) character strings. Whereas PyUnicode_FromFormatV() converts the format string (bytes) to unicode (characters). If you would like a comparaison in C, it's like printf()+mbstowcs() in the same function. > This means that printf-like formatters should not care whether the format > string is in UTF-8, Latin1, or any other ASCII-compatible 8-bit encoding. It's maybe true with bytes input and bytes output (eg. PyString_FromFormatV() of Python2), but it's no more true with bytes input and str output (eg. PyUnicode_FromFormatV() of Python3). > It is also fairly simple to ssnity-check for UTF-8 if necessary, but in > case of PyUnicode_FromFormat, the resulting string will be decoded as > UTF-8, so all characters in the format string will be checked anyways. I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long (210 lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas ASCII decode is just: "unicode_char = (Py_UNICODE)byte;" + an if before to check that 0 <= byte <= 127). Nobody noticed my change just because the whole Python code base only uses ASCII argument for the format argument of PyUnicode_FromFormatV(). Victor

On Friday 19 November 2010 20:42:53 you wrote:
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
> 
> I don't understand Victor's argument in msg115889.  According to UTF-8 RFC,
> <http://www.ietf.org/rfc/rfc2279.txt>:
> 
>    -  US-ASCII values do not appear otherwise in a UTF-8 encoded
>       character stream.  This provides compatibility with file systems
>       or other software (e.g. the printf() function in C libraries) that
>       parse based on US-ASCII values but are transparent to other
>       values.

Most C functions including printf works on multi*byte* strings, not on (wide) 
character strings. Whereas PyUnicode_FromFormatV() converts the format string 
(bytes) to unicode (characters). If you would like a comparaison in C, it's 
like printf()+mbstowcs() in the same function.

> This means that printf-like formatters should not care whether the format
> string is in UTF-8, Latin1, or any other ASCII-compatible 8-bit encoding. 

It's maybe true with bytes input and bytes output (eg. PyString_FromFormatV() 
of Python2), but it's no more true with bytes input and str output (eg. 
PyUnicode_FromFormatV() of Python3).

> It is also fairly simple to ssnity-check for UTF-8 if necessary, but in
> case of PyUnicode_FromFormat, the resulting string will be decoded as
> UTF-8, so all characters in the format string will be checked anyways.

I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long (210 
lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas ASCII decode 
is just: "unicode_char = (Py_UNICODE)byte;" + an if before to check that 0 <= 
byte <= 127).

Nobody noticed my change just because the whole Python code base only uses 
ASCII argument for the format argument of PyUnicode_FromFormatV().

Victor

History
Date	User	Action	Args
2010-11-19 20:06:23	vstinner	set	recipients: + vstinner, amaury.forgeotdarc, belopolsky, ezio.melotti
2010-11-19 20:06:13	vstinner	link	issue9769 messages
2010-11-19 20:06:13	vstinner	create