Author lemburg
Recipients ezio.melotti, gvanrossum, kennyluck, lemburg, loewis, pitrou, serhiy.storchaka, tchrist, vstinner
Date 2013-10-08.10:03:06
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <5253D843.40506@egenix.com>
In-reply-to <116765461.59095722.1381224798470.JavaMail.root@zimbra10-e2.priv.proxad.net>
Content
On 08.10.2013 11:33, Antoine Pitrou wrote:
> 
> Antoine Pitrou added the comment:
> 
>> MS Notepad and MS Office save Unicode text files in UTF-16-LE,
>> unless you explicitly specify UTF-8, just like many other Windows
>> applications that support Unicode text files:
> 
> I'd be curious to know if people actually edit *text files* using
> Microsoft Word (rather than Word documents).
> Same for Notepad, which is much too poor to edit something else
> than a 10-line configuration file.

The question is not so much which program they use for editing.
The format "Unicode text file" is defined as UTF-16-LE on
Windows (see the links I posted).

>> You are forgetting that wchar_t is UTF-16 on Windows, so UTF-16
>> is all around you when working on Windows, not only in the OS APIs,
>> but also in most other Unicode APIs you find on Windows:
> 
> Still, unless those APIs get passed rather large strings, the performance
> different should be irrelevant IMHO. We're talking about using those APIs
> from Python, not from a raw optimized C program.

Antoine, I'm just pointing out that your statement that UTF-16
is not widely used may apply to the Unix world, but
it doesn't apply to Windows. Java also uses UTF-16
internally and makes this available via JNI as jchar*.

The APIs on those platforms are used from Python (the interpreter
and also by extensions) and do use the UTF-16 Python codec to
convert the data to Python Unicode/string objects, so the fact
that UTF-16 is used widely on some of the more popular
platforms does matter.

UTF-8, UTF-16 and UTF-32 codecs need to be as fast as possible
in Python to not create performance problems when converting
between platform Unicode data and the internal formats
used in Python.

The real question is: Can the UTF-16/32 codecs be made fast
while still detecting lone surrogates ? Not whether UTF-16
is widely used or not.
History
Date User Action Args
2013-10-08 10:03:06lemburgsetrecipients: + lemburg, gvanrossum, loewis, pitrou, vstinner, ezio.melotti, tchrist, kennyluck, serhiy.storchaka
2013-10-08 10:03:06lemburglinkissue12892 messages
2013-10-08 10:03:06lemburgcreate