classification
Title: Cannot write source code in UTF16
Type: enhancement Stage: test needed
Components: Documentation, Interpreter Core, Unicode Versions: Python 3.1, Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: christian.heimes, georg.brandl, loewis, tungwaiyip, vstinner
Priority: low Keywords: patch

Created on 2006-06-10 00:38 by tungwaiyip, last changed 2010-03-22 13:00 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
u16_bom.py tungwaiyip, 2006-06-10 00:38 Sample code in UTF16 with BOM
u16_hybrid.py tungwaiyip, 2006-06-10 00:39 Sample code in hybrid of ASCII and UTF16
tokenizer_bom.patch vstinner, 2009-03-24 22:00
Messages (11)
msg28764 - (view) Author: Wai Yip Tung (tungwaiyip) Date: 2006-06-10 00:38
I intend to create some source code in UTF16. I start 
the file with the encoding declaration line:

----------------------------------------------
# -*- coding: UTF-16LE -*-
print "Hello world"
----------------------------------------------

Unfortunately Python does not decode it in UTF16 as 
expected. I have found some language in PEP 0263 that 
says "It does not include encodings which use two or 
more bytes for all characters like e.g. UTF-16." While 
I am disappointed. I accepted this limitation is 
necessary to make keep the parser simple. So my first 
complaint is this fact should be documented in

http://www.python.org/doc/ref/encodings.html

Then I tried to save the source code with BOM. I think 
there should be no excuse not to decode it in UTF16 in 
that case. Unfortunately Python does not support this 
either.

Indeed the only way to get it work is to write the 
encoding declaration line in ASCII and the rest of the 
file in UTF16 (see u16_hybrid.py). Obviously most text 
editor would not support this.

I come up with this because Microsoft adopt UTF16 in 
various places.



msg28765 - (view) Author: Wai Yip Tung (tungwaiyip) Date: 2006-06-10 00:39
Logged In: YES 
user_id=561546


msg28766 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-06-10 12:27
Logged In: YES 
user_id=21627

Would you like to work on a patch?

There is, of course, a fairly obvious reason that this
doesn't work: nobody has put effort into making it work.

Personally, I suggest that you use some other encoding for
source code, e.g. UTF-8.
msg28767 - (view) Author: Wai Yip Tung (tungwaiyip) Date: 2006-06-13 16:27
Logged In: YES 
user_id=561546

That sounds good. It is probably a good time to try out the 
instructions to build Python on Windows.

http://groups.google.com/group/comp.lang.python/browse_
thread/thread/f09c49f77bf0a578/3e076bfcafb064cd?hl=en#3e076
bfcafb064cd

Can you point me to the relevant source code?


msg28768 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-06-13 17:06
Logged In: YES 
user_id=21627

The parser code is in the Parser subdirectory. It would be
good if you could follow the existing parsing strategy, i.e.
convert the input to UTF-8, and then proceed with the normal
parsing procedure.
msg28769 - (view) Author: Wai Yip Tung (tungwaiyip) Date: 2006-06-23 01:31
Logged In: YES 
user_id=561546

Turns out the code is already written but disabled. Simply 
turning it on would work.

tokenizer.c(321):
#if 0
	/* Disable support for UTF-16 BOMs until a decision
	   is made whether this needs to be supported.  */
	} else if (ch == 0xFE) {
		ch = get_char(tok); if (ch != 0xFF) goto NON_
BOM;
		if (!set_readline(tok, "utf-16-be")) return 0;
		tok->decoding_state = -1;
	} else if (ch == 0xFF) {
		ch = get_char(tok); if (ch != 0xFE) goto NON_
BOM;
		if (!set_readline(tok, "utf-16-le")) return 0;
		tok->decoding_state = -1;
#endif


Executing an utf-16 text file with BOM file would work. 
However if I also include an encoding declaration plus BOM 
like this

  # -*- coding: UTF-16le -*-


It would result in this error, for some logic in the code 
that I couldn't sort out {tokenizer.c(291)}:


  g:\bin\py_repos\python-svn\PCbuild>python_d.exe test16le.
py
    File "test16le.py", line 1
  SyntaxError: encoding problem: utf-8


If you need a justification for checking the UTF-16 BOM, it 
is Microsoft. As an early adopter of unicode before UTF-8 
is popularized, there is some software that generates UTF-
16 by default. Not a fatal issue. But I see no reason not 
to support it either.
msg60175 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-01-19 14:42
I turned it into a feature request for 2.6 with low priority. Somebody
should either fix/implement the UTF-16 support or update the docs.
msg83934 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-21 10:53
Detect UTF-16 and UTF-32 is complex. I think that we can first support 
UTF-16LE, UTF-16BE, UTF-32LE and UTF-32BE with BOM. Most editors add a 
BOM (eg. notepad.exe on Windows). I will try to fix this issue.
msg84115 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-24 22:00
Attached patch is a partial fix: support UTF-16-LE, UTF-16-BE and 
UTF-32-LE. Some remarks about my patch:
 * UTF-32-BE is not supported because I'm too lazy tonigh 
   to finish the patch and because such file begins with 0x00 0x00
   whereas the parser doesn't like nul bytes
 * I disabled the cookie check if the file starts with a BOM (the
   cookie is ignored) because the charset name is not normalized
   and so if the cookie is not exactly the same as the hardcoded
   charset name (eg. "UTF-16LE"), the test will fail. 
   Eg "utf-16le" != "UTF-16LE" :-(
 * compile() would require much more effort to support UTF-16-* 
   and UTF-32-* because compile() simply rejects any string with 
   nul byte. It's beause it uses functions like strlen() :-/ That's
   why I use subprocess([sys.executable, ...]) in the unit test and
   not simply compile()

Support UTF-{16,32}-{LE,BE} would be nice but it requires to hack to 
parser (especially compile() builtin function) to support nul bytes...
msg84823 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2009-03-31 16:18
Why am I assigned this issue?
msg101502 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-03-22 13:00
This feature was requested only once, 4 years ago, so I don't think that the feature is a "must-have" :-)

I think that a lot of code have to be modified in Python parser to support UTF-16-* and UTF-32-* codecs. Since there is no working patch, I consider that I can close this issue.
History
Date User Action Args
2010-03-22 13:00:31vstinnersetstatus: open -> closed
resolution: wont fix
messages: + msg101502
2009-03-31 16:18:07georg.brandlsetassignee: georg.brandl ->
messages: + msg84823
2009-03-24 22:00:16vstinnersetfiles: + tokenizer_bom.patch
keywords: + patch
messages: + msg84115
2009-03-21 10:53:42vstinnersetmessages: + msg83934
2009-03-21 03:45:14ajaksu2setversions: + Python 3.1, Python 2.7, - Python 2.6
nosy: + vstinner, georg.brandl

assignee: georg.brandl
components: + Unicode
stage: test needed
2008-01-19 14:42:40christian.heimessetversions: + Python 2.6, - Python 2.4
nosy: + christian.heimes
messages: + msg60175
priority: normal -> low
components: + Documentation
type: enhancement
2006-06-10 00:38:33tungwaiyipcreate