This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author belopolsky
Recipients amaury.forgeotdarc, belopolsky, ezio.melotti, vstinner
Date 2010-11-19.20:58:22
SpamBayes Score 4.7996884e-11
Marked as misclassified No
Message-id <AANLkTimBUsS3VQixbG-NLHP_PrDvK7jxHZQUfmqBCvam@mail.gmail.com>
In-reply-to <201011192106.06983.victor.stinner@haypocalc.com>
Content
On Fri, Nov 19, 2010 at 3:06 PM, STINNER Victor <report@bugs.python.org> wrote:
> .. Whereas PyUnicode_FromFormatV() converts the format string
> (bytes) to unicode (characters). If you would like a comparaison in C, it's
> like printf()+mbstowcs() in the same function.
>

I see.  So it is really the

        else
            *s++ = *f;

that surreptitiously widens the characters.

..
> I choosed to use ASCII instead of UTF-8, because an UTF-8 decoder is long (210
> lines) and complex (see PyUnicode_DecodeUTF8Stateful()), whereas ASCII decode
> is just: "unicode_char = (Py_UNICODE)byte;" + an if before to check that 0 <=
> byte <= 127).

I don't think we need 210 lines to replace "*s++ = *f" with proper
UTF-8 logic.  Even if we do, the code can be shared with
PyUnicode_DecodeUTF8 and a UTF-8 iterator may be a welcome addition to
Python C API.
History
Date User Action Args
2010-11-19 20:58:25belopolskysetrecipients: + belopolsky, amaury.forgeotdarc, vstinner, ezio.melotti
2010-11-19 20:58:22belopolskylinkissue9769 messages
2010-11-19 20:58:22belopolskycreate