Issue 1503789: Cannot write source code in UTF16

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/43482

classification

Title:	Cannot write source code in UTF16
Type:	enhancement	Stage:	test needed
Components:	Documentation, Interpreter Core, Unicode	Versions:	Python 3.1, Python 2.7

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	christian.heimes, georg.brandl, loewis, tungwaiyip, vstinner
Priority:	low	Keywords:	patch

Created on 2006-06-10 00:38 by tungwaiyip, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
u16_bom.py	tungwaiyip, 2006-06-10 00:38	Sample code in UTF16 with BOM
u16_hybrid.py	tungwaiyip, 2006-06-10 00:39	Sample code in hybrid of ASCII and UTF16
tokenizer_bom.patch	vstinner, 2009-03-24 22:00

Messages (11)
msg28764 - (view)	Author: Wai Yip Tung (tungwaiyip)	Date: 2006-06-10 00:38
I intend to create some source code in UTF16. I start the file with the encoding declaration line: ---------------------------------------------- # -- coding: UTF-16LE -- print "Hello world" ---------------------------------------------- Unfortunately Python does not decode it in UTF16 as expected. I have found some language in PEP 0263 that says "It does not include encodings which use two or more bytes for all characters like e.g. UTF-16." While I am disappointed. I accepted this limitation is necessary to make keep the parser simple. So my first complaint is this fact should be documented in http://www.python.org/doc/ref/encodings.html Then I tried to save the source code with BOM. I think there should be no excuse not to decode it in UTF16 in that case. Unfortunately Python does not support this either. Indeed the only way to get it work is to write the encoding declaration line in ASCII and the rest of the file in UTF16 (see u16_hybrid.py). Obviously most text editor would not support this. I come up with this because Microsoft adopt UTF16 in various places.
msg28765 - (view)	Author: Wai Yip Tung (tungwaiyip)	Date: 2006-06-10 00:39
Logged In: YES user_id=561546
msg28766 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2006-06-10 12:27
Logged In: YES user_id=21627 Would you like to work on a patch? There is, of course, a fairly obvious reason that this doesn't work: nobody has put effort into making it work. Personally, I suggest that you use some other encoding for source code, e.g. UTF-8.
msg28767 - (view)	Author: Wai Yip Tung (tungwaiyip)	Date: 2006-06-13 16:27
Logged In: YES user_id=561546 That sounds good. It is probably a good time to try out the instructions to build Python on Windows. http://groups.google.com/group/comp.lang.python/browse_ thread/thread/f09c49f77bf0a578/3e076bfcafb064cd?hl=en#3e076 bfcafb064cd Can you point me to the relevant source code?
msg28768 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2006-06-13 17:06
Logged In: YES user_id=21627 The parser code is in the Parser subdirectory. It would be good if you could follow the existing parsing strategy, i.e. convert the input to UTF-8, and then proceed with the normal parsing procedure.
msg28769 - (view)	Author: Wai Yip Tung (tungwaiyip)	Date: 2006-06-23 01:31
Logged In: YES user_id=561546 Turns out the code is already written but disabled. Simply turning it on would work. tokenizer.c(321): #if 0 /* Disable support for UTF-16 BOMs until a decision is made whether this needs to be supported. / } else if (ch == 0xFE) { ch = get_char(tok); if (ch != 0xFF) goto NON_ BOM; if (!set_readline(tok, "utf-16-be")) return 0; tok->decoding_state = -1; } else if (ch == 0xFF) { ch = get_char(tok); if (ch != 0xFE) goto NON_ BOM; if (!set_readline(tok, "utf-16-le")) return 0; tok->decoding_state = -1; #endif Executing an utf-16 text file with BOM file would work. However if I also include an encoding declaration plus BOM like this # -- coding: UTF-16le -*- It would result in this error, for some logic in the code that I couldn't sort out {tokenizer.c(291)}: g:\bin\py_repos\python-svn\PCbuild>python_d.exe test16le. py File "test16le.py", line 1 SyntaxError: encoding problem: utf-8 If you need a justification for checking the UTF-16 BOM, it is Microsoft. As an early adopter of unicode before UTF-8 is popularized, there is some software that generates UTF- 16 by default. Not a fatal issue. But I see no reason not to support it either.
msg60175 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2008-01-19 14:42
I turned it into a feature request for 2.6 with low priority. Somebody should either fix/implement the UTF-16 support or update the docs.
msg83934 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-03-21 10:53
Detect UTF-16 and UTF-32 is complex. I think that we can first support UTF-16LE, UTF-16BE, UTF-32LE and UTF-32BE with BOM. Most editors add a BOM (eg. notepad.exe on Windows). I will try to fix this issue.
msg84115 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-03-24 22:00
Attached patch is a partial fix: support UTF-16-LE, UTF-16-BE and UTF-32-LE. Some remarks about my patch: * UTF-32-BE is not supported because I'm too lazy tonigh to finish the patch and because such file begins with 0x00 0x00 whereas the parser doesn't like nul bytes * I disabled the cookie check if the file starts with a BOM (the cookie is ignored) because the charset name is not normalized and so if the cookie is not exactly the same as the hardcoded charset name (eg. "UTF-16LE"), the test will fail. Eg "utf-16le" != "UTF-16LE" :-( * compile() would require much more effort to support UTF-16-* and UTF-32-* because compile() simply rejects any string with nul byte. It's beause it uses functions like strlen() :-/ That's why I use subprocess([sys.executable, ...]) in the unit test and not simply compile() Support UTF-{16,32}-{LE,BE} would be nice but it requires to hack to parser (especially compile() builtin function) to support nul bytes...
msg84823 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2009-03-31 16:18
Why am I assigned this issue?
msg101502 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-03-22 13:00
This feature was requested only once, 4 years ago, so I don't think that the feature is a "must-have" :-) I think that a lot of code have to be modified in Python parser to support UTF-16-* and UTF-32-* codecs. Since there is no working patch, I consider that I can close this issue.

History
Date	User	Action	Args
2022-04-11 14:56:18	admin	set	github: 43482
2010-03-22 13:00:31	vstinner	set	status: open -> closed resolution: wont fix messages: + msg101502
2009-03-31 16:18:07	georg.brandl	set	assignee: georg.brandl -> messages: + msg84823
2009-03-24 22:00:16	vstinner	set	files: + tokenizer_bom.patch keywords: + patch messages: + msg84115
2009-03-21 10:53:42	vstinner	set	messages: + msg83934
2009-03-21 03:45:14	ajaksu2	set	versions: + Python 3.1, Python 2.7, - Python 2.6 nosy: + vstinner, georg.brandl assignee: georg.brandl components: + Unicode stage: test needed
2008-01-19 14:42:40	christian.heimes	set	versions: + Python 2.6, - Python 2.4 nosy: + christian.heimes messages: + msg60175 priority: normal -> low components: + Documentation type: enhancement
2006-06-10 00:38:33	tungwaiyip	create