Issue 25937: DIfference between utf8 and utf-8 when i define python source code encoding.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/70125

classification

Title:	DIfference between utf8 and utf-8 when i define python source code encoding.
Type:	behavior	Stage:	patch review
Components:	Interpreter Core	Versions:	Python 2.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Jim.Jewett, doerwalter, lemburg, serhiy.storchaka, terry.reedy, vstinner, 王杰
Priority:	normal	Keywords:	patch

Created on 2015-12-24 03:49 by 王杰, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
utf8.patch	vstinner, 2015-12-26 10:59		review
bad_iso8859_3.py	serhiy.storchaka, 2015-12-26 11:25
bad_utf8.patch	serhiy.storchaka, 2016-02-10 07:54		review

Messages (16)
msg256952 - (view)	Author: 王杰 (王杰)	Date: 2015-12-24 03:49
I use CentOS 7.0 and change LANG=gbk. I has a file "gbk-utf-8.py" and it's encoding is GBK. # -- coding:utf-8 -- import chardet if __name__ == '__main__': s = '中文' print s, chardet.detect(s) I execute it and everything is ok. However it raise "SyntaxError" (as I expected) after I change "encoding:utf-8" to "encoding:utf8". File "gbk-utf8.py", line 2 SyntaxError: 'utf8' codec can't decode byte 0xd6 in position 0: invalid continuation byte Is this ok? Or where I wrong?
msg257005 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2015-12-25 18:03
What Python version?
msg257020 - (view)	Author: 王杰 (王杰)	Date: 2015-12-26 08:57
Python 2.7
msg257023 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-12-26 10:59
Here is a fix with a patch.
msg257024 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-12-26 11:00
> Here is a fix with a patch. Oops, I mean 'with an unit test', sorry ;-)
msg257025 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-12-26 11:01
> I has a file "gbk-utf-8.py" and it's encoding is GBK. I don't understand why you use "# coding: utf-8" if the file is encoded to GBK. Why not using "# coding: gbk"?
msg257026 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-12-26 11:25
The problem is not that an error is raised with coding:utf8, but that it isn't raised with coding:utf-8. Here is an example with bad iso8859-3. An error is raised as expected.
msg257028 - (view)	Author: 王杰 (王杰)	Date: 2015-12-26 12:27
I'm learning about Python's encoding rule and I write it as a test case.
msg257047 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2015-12-26 20:35
Please fold these cases into one: if (strcmp(buf, "utf-8") == 0 \|\| strncmp(buf, "utf-8-", 6) == 0) return "utf-8"; else if (strcmp(buf, "utf8") == 0 \|\| strncmp(buf, "utf8-", 6) == 0) return "utf-8"; -> if (strcmp(buf, "utf-8") == 0 \|\| strncmp(buf, "utf-8-", 6) == 0 \|\| strcmp(buf, "utf8") == 0 \|\| strncmp(buf, "utf8-", 6) == 0) return "utf-8"; Also: I wonder why the regular utf_8.py codec doesn't complain about this case, since the above are only shortcuts for frequently used source code encodings.
msg257050 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-12-26 21:46
In Python, there are multiple implementations of the utf-8 codec with many shortcuts. I'm not surprised to see bugs depending on the exact syntax of the utf-8 codec name. Maybe we need to share even more code to normalize and compare codec names. (I think that py3 is better than py2 on this part.)
msg257051 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2015-12-26 22:05
On 26.12.2015 22:46, STINNER Victor wrote: > > In Python, there are multiple implementations of the utf-8 codec with many > shortcuts. I'm not surprised to see bugs depending on the exact syntax of > the utf-8 codec name. Maybe we need to share even more code to normalize > and compare codec names. (I think that py3 is better than py2 on this part.) There's only one implementation (the one in unicodeobject.c), which is used directly or via the wrapper in the encodings package, but there are a few shortcuts to bypass the codec registry scattered around the code since UTF-8 is such a commonly used codec. In the case in question, the codec registry should trigger decoding via the encodings package (rather than going directly to C APIs), so will eventually end up using the same code. I wonder why this does not trigger the exception.
msg257060 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-12-27 01:05
> I wonder why this does not trigger the exception. Because in case of utf-8 and iso-8859-1 decoding and encoding steps are omitted. In general case the input is decoded from specified encoding and than encoded to UTF-8 for parser. But for utf-8 and iso-8859-1 encodings the parser gets the raw data.
msg257074 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2015-12-27 12:33
On 27.12.2015 02:05, Serhiy Storchaka wrote: > >> I wonder why this does not trigger the exception. > > Because in case of utf-8 and iso-8859-1 decoding and encoding steps are omitted. > > In general case the input is decoded from specified encoding and than encoded to UTF-8 for parser. But for utf-8 and iso-8859-1 encodings the parser gets the raw data. Right, but since the tokenizer doesn't know about "utf8" it should reach out to the codec registry to get a properly encoded version of the source code (even though this is an unnecessary round-trip). There are few other aliases for UTF-8 which would likely trigger the same problem: # utf_8 codec 'u8' : 'utf_8', 'utf' : 'utf_8', 'utf8' : 'utf_8', 'utf8_ucs2' : 'utf_8', 'utf8_ucs4' : 'utf_8',
msg259990 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-02-10 07:54
I think the correct way is not add "utf8" to special case, but removes "utf-8". Here is a patch.
msg260054 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2016-02-10 22:57
Does (did?) the utf8 special case allow for a much faster startup time, by not requiring all of the codecs machinery?
msg260078 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2016-02-11 08:16
Serhiy: Removing the shortcut would slow down the tokenizer a lot since UTF-8 encoded source code is the norm, not the exception. The "problem" here is that the tokenizer trusts the source code in being in the correct encoding when you use one of utf-8 or iso-8859-1 and then skips the usual "decode into unicode, then encode to utf-8" step. From a purist point of view, you are right, Python should always pass through those steps to detect encoding errors, but from a practical point of view, I think the optimization is fine.

History
Date	User	Action	Args
2022-04-11 14:58:25	admin	set	github: 70125
2016-12-06 12:14:59	serhiy.storchaka	link	issue28884 dependencies
2016-02-11 08:16:29	lemburg	set	messages: + msg260078
2016-02-10 22:57:34	Jim.Jewett	set	nosy: + Jim.Jewett messages: + msg260054
2016-02-10 07:54:20	serhiy.storchaka	set	files: + bad_utf8.patch messages: + msg259990 components: + Interpreter Core type: behavior stage: patch review
2015-12-27 12:33:05	lemburg	set	messages: + msg257074
2015-12-27 01:05:15	serhiy.storchaka	set	messages: + msg257060
2015-12-26 22:05:17	lemburg	set	messages: + msg257051
2015-12-26 21:46:17	vstinner	set	messages: + msg257050
2015-12-26 20:35:00	lemburg	set	messages: + msg257047
2015-12-26 12:27:32	王杰	set	messages: + msg257028
2015-12-26 11:25:37	serhiy.storchaka	set	files: + bad_iso8859_3.py nosy: + serhiy.storchaka messages: + msg257026
2015-12-26 11:01:34	vstinner	set	messages: + msg257025
2015-12-26 11:00:05	vstinner	set	messages: + msg257024
2015-12-26 10:59:42	vstinner	set	files: + utf8.patch keywords: + patch messages: + msg257023 versions: + Python 2.7
2015-12-26 08:57:27	王杰	set	messages: + msg257020
2015-12-25 18:03:50	terry.reedy	set	nosy: + terry.reedy messages: + msg257005
2015-12-25 18:02:56	terry.reedy	set	nosy: + vstinner
2015-12-25 18:02:40	terry.reedy	set	nosy: + lemburg, doerwalter
2015-12-24 03:49:49	王杰	create