classification
Title: DIfference between utf8 and utf-8 when i define python source code encoding.
Type: behavior Stage: patch review
Components: Interpreter Core Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Jim.Jewett, doerwalter, lemburg, serhiy.storchaka, terry.reedy, vstinner, 王杰
Priority: normal Keywords: patch

Created on 2015-12-24 03:49 by 王杰, last changed 2016-02-11 08:16 by lemburg.

Files
File name Uploaded Description Edit
utf8.patch vstinner, 2015-12-26 10:59 review
bad_iso8859_3.py serhiy.storchaka, 2015-12-26 11:25
bad_utf8.patch serhiy.storchaka, 2016-02-10 07:54 review
Messages (16)
msg256952 - (view) Author: 王杰 (王杰) Date: 2015-12-24 03:49
I use CentOS 7.0 and change LANG=gbk.

I has a file "gbk-utf-8.py" and it's encoding is GBK.

# -*- coding:utf-8 -*-
import chardet
if __name__ == '__main__':
    s = '中文'
    print s, chardet.detect(s) 

I execute it and everything is ok. However it raise "SyntaxError" (as I expected) after I change "encoding:utf-8" to "encoding:utf8".

  File "gbk-utf8.py", line 2
SyntaxError: 'utf8' codec can't decode byte 0xd6 in position 0: invalid continuation byte

Is this ok? Or where I wrong?
msg257005 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2015-12-25 18:03
What Python version?
msg257020 - (view) Author: 王杰 (王杰) Date: 2015-12-26 08:57
Python 2.7
msg257023 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-12-26 10:59
Here is a fix with a patch.
msg257024 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-12-26 11:00
> Here is a fix with a patch.

Oops, I mean 'with an unit test', sorry ;-)
msg257025 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-12-26 11:01
> I has a file "gbk-utf-8.py" and it's encoding is GBK.

I don't understand why you use "# coding: utf-8" if the file is encoded to GBK. Why not using "# coding: gbk"?
msg257026 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-12-26 11:25
The problem is not that an error is raised with coding:utf8, but that it isn't raised with coding:utf-8.

Here is an example with bad iso8859-3. An error is raised as expected.
msg257028 - (view) Author: 王杰 (王杰) Date: 2015-12-26 12:27
I'm learning about Python's encoding rule and I write it as a test case.
msg257047 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2015-12-26 20:35
Please fold these cases into one:

     if (strcmp(buf, "utf-8") == 0 ||
         strncmp(buf, "utf-8-", 6) == 0)
         return "utf-8";
     else if (strcmp(buf, "utf8") == 0 ||
         strncmp(buf, "utf8-", 6) == 0)
         return "utf-8";

->

     if (strcmp(buf, "utf-8") == 0 ||
         strncmp(buf, "utf-8-", 6) == 0 ||
         strcmp(buf, "utf8") == 0 ||
         strncmp(buf, "utf8-", 6) == 0)
         return "utf-8";

Also: I wonder why the regular utf_8.py codec doesn't complain about this case, since the above are only shortcuts for frequently used source code encodings.
msg257050 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-12-26 21:46
In Python, there are multiple implementations of the utf-8 codec with many
shortcuts. I'm not surprised to see bugs depending on the exact syntax of
the utf-8 codec name. Maybe we need to share even more code to normalize
and compare codec names. (I think that py3 is better than py2 on this part.)
msg257051 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2015-12-26 22:05
On 26.12.2015 22:46, STINNER Victor wrote:
> 
> In Python, there are multiple implementations of the utf-8 codec with many
> shortcuts. I'm not surprised to see bugs depending on the exact syntax of
> the utf-8 codec name. Maybe we need to share even more code to normalize
> and compare codec names. (I think that py3 is better than py2 on this part.)

There's only one implementation (the one in unicodeobject.c), which is used
directly or via the wrapper in the encodings package, but there
are a few shortcuts to bypass the codec registry scattered around
the code since UTF-8 is such a commonly used codec.

In the case in question, the codec registry should trigger decoding
via the encodings package (rather than going directly to C APIs),
so will eventually end up using the same code. I wonder why this does not
trigger the exception.
msg257060 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-12-27 01:05
> I wonder why this does not trigger the exception.

Because in case of utf-8 and iso-8859-1 decoding and encoding steps are omitted.

In general case the input is decoded from specified encoding and than encoded to UTF-8 for parser. But for utf-8 and iso-8859-1 encodings the parser gets the raw data.
msg257074 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2015-12-27 12:33
On 27.12.2015 02:05, Serhiy Storchaka wrote:
> 
>> I wonder why this does not trigger the exception.
> 
> Because in case of utf-8 and iso-8859-1 decoding and encoding steps are omitted.
>
> In general case the input is decoded from specified encoding and than encoded to UTF-8 for parser. But for utf-8 and iso-8859-1 encodings the parser gets the raw data.

Right, but since the tokenizer doesn't know about "utf8" it
should reach out to the codec registry to get a properly encoded
version of the source code (even though this is an unnecessary
round-trip).

There are few other aliases for UTF-8 which would likely trigger
the same problem:

    # utf_8 codec
    'u8'                 : 'utf_8',
    'utf'                : 'utf_8',
    'utf8'               : 'utf_8',
    'utf8_ucs2'          : 'utf_8',
    'utf8_ucs4'          : 'utf_8',
msg259990 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-10 07:54
I think the correct way is not add "utf8" to special case, but removes "utf-8". Here is a patch.
msg260054 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2016-02-10 22:57
Does (did?) the utf8 special case allow for a much faster startup time, by not requiring all of the codecs machinery?
msg260078 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-02-11 08:16
Serhiy: Removing the shortcut would slow down the tokenizer a lot since UTF-8 encoded source code is the norm, not the exception.

The "problem" here is that the tokenizer trusts the source code in being in the correct encoding when you use one of utf-8 or iso-8859-1 and then skips the usual "decode into unicode, then encode to utf-8" step.

From a purist point of view, you are right, Python should always pass through those steps to detect encoding errors, but from a practical point of view, I think the optimization is fine.
History
Date User Action Args
2016-12-06 12:14:59serhiy.storchakalinkissue28884 dependencies
2016-02-11 08:16:29lemburgsetmessages: + msg260078
2016-02-10 22:57:34Jim.Jewettsetnosy: + Jim.Jewett
messages: + msg260054
2016-02-10 07:54:20serhiy.storchakasetfiles: + bad_utf8.patch
messages: + msg259990

components: + Interpreter Core
type: behavior
stage: patch review
2015-12-27 12:33:05lemburgsetmessages: + msg257074
2015-12-27 01:05:15serhiy.storchakasetmessages: + msg257060
2015-12-26 22:05:17lemburgsetmessages: + msg257051
2015-12-26 21:46:17vstinnersetmessages: + msg257050
2015-12-26 20:35:00lemburgsetmessages: + msg257047
2015-12-26 12:27:32王杰setmessages: + msg257028
2015-12-26 11:25:37serhiy.storchakasetfiles: + bad_iso8859_3.py
nosy: + serhiy.storchaka
messages: + msg257026

2015-12-26 11:01:34vstinnersetmessages: + msg257025
2015-12-26 11:00:05vstinnersetmessages: + msg257024
2015-12-26 10:59:42vstinnersetfiles: + utf8.patch
keywords: + patch
messages: + msg257023

versions: + Python 2.7
2015-12-26 08:57:27王杰setmessages: + msg257020
2015-12-25 18:03:50terry.reedysetnosy: + terry.reedy
messages: + msg257005
2015-12-25 18:02:56terry.reedysetnosy: + vstinner
2015-12-25 18:02:40terry.reedysetnosy: + lemburg, doerwalter
2015-12-24 03:49:49王杰create