Message 241460 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	ezio.melotti, ncoghlan, r.david.murray, vstinner
Date	2015-04-18.22:29:23
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1429396163.9.0.133621367784.issue23993@psf.upfronthosting.co.za>
In-reply-to

Content
> "if you are using the C locale you or the OS are broken anyway, so we'll just pass the bytes through" Exactly. Even if you use Unicode, the Python 3 str type, you store text as raw bytes (in a custom format, as surrogate characters). > I'm not entirely convinced this won't cause issues, but I suppose it might not cause any more issues that having things break due to the C locale does. The most obvious issue is the come back of mojibake. Since you manipulate raw bytes, it's easy to concatenate two bytes strings encoded to two different encodings. https://unicodebook.readthedocs.org/definitions.html#mojibake The problem is that the question is not how bad it is use to manipulate text as bytes. The problem is that a working application written for Python 2 starts to randomly fail (on non-ASCII characters) on Python 3 when the LC_CTYPE locale is the POSIX locale ("C"). The first question is: should I keep Python 2 or write my application in a language which doesn't force me to understand Unicode?

> "if you are using the C locale you or the OS are broken anyway, so we'll just pass the bytes through"

Exactly. Even if you use Unicode, the Python 3 str type, you store text as raw bytes (in a custom format, as surrogate characters).

> I'm not entirely convinced this won't cause issues, but I suppose it might not cause any more issues that having things break due to the C locale does.

The most obvious issue is the come back of mojibake. Since you manipulate raw bytes, it's easy to concatenate two bytes strings encoded to two different encodings.
https://unicodebook.readthedocs.org/definitions.html#mojibake

The problem is that the question is not how bad it is use to manipulate text as bytes. The problem is that a working application written for Python 2 starts to randomly fail (on non-ASCII characters) on Python 3 when the LC_CTYPE locale is the POSIX locale ("C"). The first question is: should I keep Python 2 or write my application in a language which doesn't force me to understand Unicode?

History
Date	User	Action	Args
2015-04-18 22:29:23	vstinner	set	recipients: + vstinner, ncoghlan, ezio.melotti, r.david.murray
2015-04-18 22:29:23	vstinner	set	messageid: <1429396163.9.0.133621367784.issue23993@psf.upfronthosting.co.za>
2015-04-18 22:29:23	vstinner	link	issue23993 messages
2015-04-18 22:29:23	vstinner	create