Message 121568 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	belopolsky
Recipients	amaury.forgeotdarc, belopolsky, ezio.melotti, vstinner
Date	2010-11-19.20:58:22
SpamBayes Score	4.7996884e-11
Marked as misclassified	No
Message-id	<AANLkTimBUsS3VQixbG-NLHP_PrDvK7jxHZQUfmqBCvam@mail.gmail.com>
In-reply-to	<201011192106.06983.victor.stinner@haypocalc.com>

Content
On Fri, Nov 19, 2010 at 3:06 PM, STINNER Victor <report@bugs.python.org> wrote: > .. Whereas PyUnicode_FromFormatV() converts the format string > (bytes) to unicode (characters). If you would like a comparaison in C, it's > like printf()+mbstowcs() in the same function. > I see. So it is really the else s++ = f; that surreptitiously widens the characters. .. > I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long (210 > lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas ASCII decode > is just: "unicode_char = (Py_UNICODE)byte;" + an if before to check that 0 <= > byte <= 127). I don't think we need 210 lines to replace "s++ = f" with proper UTF-8 logic. Even if we do, the code can be shared with PyUnicode_DecodeUTF8 and a UTF-8 iterator may be a welcome addition to Python C API.

On Fri, Nov 19, 2010 at 3:06 PM, STINNER Victor <report@bugs.python.org> wrote:
> .. Whereas PyUnicode_FromFormatV() converts the format string
> (bytes) to unicode (characters). If you would like a comparaison in C, it's
> like printf()+mbstowcs() in the same function.
>

I see.  So it is really the

        else
            *s++ = *f;

that surreptitiously widens the characters.

..
> I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long (210
> lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas ASCII decode
> is just: "unicode_char = (Py_UNICODE)byte;" + an if before to check that 0 <=
> byte <= 127).

I don't think we need 210 lines to replace "*s++ = *f" with proper
UTF-8 logic.  Even if we do, the code can be shared with
PyUnicode_DecodeUTF8 and a UTF-8 iterator may be a welcome addition to
Python C API.

History
Date	User	Action	Args
2010-11-19 20:58:25	belopolsky	set	recipients: + belopolsky, amaury.forgeotdarc, vstinner, ezio.melotti
2010-11-19 20:58:22	belopolsky	link	issue9769 messages
2010-11-19 20:58:22	belopolsky	create