Message 199189 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	ezio.melotti, gvanrossum, kennyluck, lemburg, loewis, pitrou, serhiy.storchaka, tchrist, vstinner
Date	2013-10-08.10:03:06
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<5253D843.40506@egenix.com>
In-reply-to	<116765461.59095722.1381224798470.JavaMail.root@zimbra10-e2.priv.proxad.net>

Content
On 08.10.2013 11:33, Antoine Pitrou wrote: > > Antoine Pitrou added the comment: > >> MS Notepad and MS Office save Unicode text files in UTF-16-LE, >> unless you explicitly specify UTF-8, just like many other Windows >> applications that support Unicode text files: > > I'd be curious to know if people actually edit text files using > Microsoft Word (rather than Word documents). > Same for Notepad, which is much too poor to edit something else > than a 10-line configuration file. The question is not so much which program they use for editing. The format "Unicode text file" is defined as UTF-16-LE on Windows (see the links I posted). >> You are forgetting that wchar_t is UTF-16 on Windows, so UTF-16 >> is all around you when working on Windows, not only in the OS APIs, >> but also in most other Unicode APIs you find on Windows: > > Still, unless those APIs get passed rather large strings, the performance > different should be irrelevant IMHO. We're talking about using those APIs > from Python, not from a raw optimized C program. Antoine, I'm just pointing out that your statement that UTF-16 is not widely used may apply to the Unix world, but it doesn't apply to Windows. Java also uses UTF-16 internally and makes this available via JNI as jchar*. The APIs on those platforms are used from Python (the interpreter and also by extensions) and do use the UTF-16 Python codec to convert the data to Python Unicode/string objects, so the fact that UTF-16 is used widely on some of the more popular platforms does matter. UTF-8, UTF-16 and UTF-32 codecs need to be as fast as possible in Python to not create performance problems when converting between platform Unicode data and the internal formats used in Python. The real question is: Can the UTF-16/32 codecs be made fast while still detecting lone surrogates ? Not whether UTF-16 is widely used or not.

On 08.10.2013 11:33, Antoine Pitrou wrote:
> 
> Antoine Pitrou added the comment:
> 
>> MS Notepad and MS Office save Unicode text files in UTF-16-LE,
>> unless you explicitly specify UTF-8, just like many other Windows
>> applications that support Unicode text files:
> 
> I'd be curious to know if people actually edit *text files* using
> Microsoft Word (rather than Word documents).
> Same for Notepad, which is much too poor to edit something else
> than a 10-line configuration file.

The question is not so much which program they use for editing.
The format "Unicode text file" is defined as UTF-16-LE on
Windows (see the links I posted).

>> You are forgetting that wchar_t is UTF-16 on Windows, so UTF-16
>> is all around you when working on Windows, not only in the OS APIs,
>> but also in most other Unicode APIs you find on Windows:
> 
> Still, unless those APIs get passed rather large strings, the performance
> different should be irrelevant IMHO. We're talking about using those APIs
> from Python, not from a raw optimized C program.

Antoine, I'm just pointing out that your statement that UTF-16
is not widely used may apply to the Unix world, but
it doesn't apply to Windows. Java also uses UTF-16
internally and makes this available via JNI as jchar*.

The APIs on those platforms are used from Python (the interpreter
and also by extensions) and do use the UTF-16 Python codec to
convert the data to Python Unicode/string objects, so the fact
that UTF-16 is used widely on some of the more popular
platforms does matter.

UTF-8, UTF-16 and UTF-32 codecs need to be as fast as possible
in Python to not create performance problems when converting
between platform Unicode data and the internal formats
used in Python.

The real question is: Can the UTF-16/32 codecs be made fast
while still detecting lone surrogates ? Not whether UTF-16
is widely used or not.

History
Date	User	Action	Args
2013-10-08 10:03:06	lemburg	set	recipients: + lemburg, gvanrossum, loewis, pitrou, vstinner, ezio.melotti, tchrist, kennyluck, serhiy.storchaka
2013-10-08 10:03:06	lemburg	link	issue12892 messages
2013-10-08 10:03:06	lemburg	create