msg256952 - (view) |
Author: 王杰 (王杰) |
Date: 2015-12-24 03:49 |
I use CentOS 7.0 and change LANG=gbk.
I has a file "gbk-utf-8.py" and it's encoding is GBK.
# -*- coding:utf-8 -*-
import chardet
if __name__ == '__main__':
s = '中文'
print s, chardet.detect(s)
I execute it and everything is ok. However it raise "SyntaxError" (as I expected) after I change "encoding:utf-8" to "encoding:utf8".
File "gbk-utf8.py", line 2
SyntaxError: 'utf8' codec can't decode byte 0xd6 in position 0: invalid continuation byte
Is this ok? Or where I wrong?
|
msg257005 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2015-12-25 18:03 |
What Python version?
|
msg257020 - (view) |
Author: 王杰 (王杰) |
Date: 2015-12-26 08:57 |
Python 2.7
|
msg257023 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2015-12-26 10:59 |
Here is a fix with a patch.
|
msg257024 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2015-12-26 11:00 |
> Here is a fix with a patch.
Oops, I mean 'with an unit test', sorry ;-)
|
msg257025 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2015-12-26 11:01 |
> I has a file "gbk-utf-8.py" and it's encoding is GBK.
I don't understand why you use "# coding: utf-8" if the file is encoded to GBK. Why not using "# coding: gbk"?
|
msg257026 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2015-12-26 11:25 |
The problem is not that an error is raised with coding:utf8, but that it isn't raised with coding:utf-8.
Here is an example with bad iso8859-3. An error is raised as expected.
|
msg257028 - (view) |
Author: 王杰 (王杰) |
Date: 2015-12-26 12:27 |
I'm learning about Python's encoding rule and I write it as a test case.
|
msg257047 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2015-12-26 20:35 |
Please fold these cases into one:
if (strcmp(buf, "utf-8") == 0 ||
strncmp(buf, "utf-8-", 6) == 0)
return "utf-8";
else if (strcmp(buf, "utf8") == 0 ||
strncmp(buf, "utf8-", 6) == 0)
return "utf-8";
->
if (strcmp(buf, "utf-8") == 0 ||
strncmp(buf, "utf-8-", 6) == 0 ||
strcmp(buf, "utf8") == 0 ||
strncmp(buf, "utf8-", 6) == 0)
return "utf-8";
Also: I wonder why the regular utf_8.py codec doesn't complain about this case, since the above are only shortcuts for frequently used source code encodings.
|
msg257050 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2015-12-26 21:46 |
In Python, there are multiple implementations of the utf-8 codec with many
shortcuts. I'm not surprised to see bugs depending on the exact syntax of
the utf-8 codec name. Maybe we need to share even more code to normalize
and compare codec names. (I think that py3 is better than py2 on this part.)
|
msg257051 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2015-12-26 22:05 |
On 26.12.2015 22:46, STINNER Victor wrote:
>
> In Python, there are multiple implementations of the utf-8 codec with many
> shortcuts. I'm not surprised to see bugs depending on the exact syntax of
> the utf-8 codec name. Maybe we need to share even more code to normalize
> and compare codec names. (I think that py3 is better than py2 on this part.)
There's only one implementation (the one in unicodeobject.c), which is used
directly or via the wrapper in the encodings package, but there
are a few shortcuts to bypass the codec registry scattered around
the code since UTF-8 is such a commonly used codec.
In the case in question, the codec registry should trigger decoding
via the encodings package (rather than going directly to C APIs),
so will eventually end up using the same code. I wonder why this does not
trigger the exception.
|
msg257060 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2015-12-27 01:05 |
> I wonder why this does not trigger the exception.
Because in case of utf-8 and iso-8859-1 decoding and encoding steps are omitted.
In general case the input is decoded from specified encoding and than encoded to UTF-8 for parser. But for utf-8 and iso-8859-1 encodings the parser gets the raw data.
|
msg257074 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2015-12-27 12:33 |
On 27.12.2015 02:05, Serhiy Storchaka wrote:
>
>> I wonder why this does not trigger the exception.
>
> Because in case of utf-8 and iso-8859-1 decoding and encoding steps are omitted.
>
> In general case the input is decoded from specified encoding and than encoded to UTF-8 for parser. But for utf-8 and iso-8859-1 encodings the parser gets the raw data.
Right, but since the tokenizer doesn't know about "utf8" it
should reach out to the codec registry to get a properly encoded
version of the source code (even though this is an unnecessary
round-trip).
There are few other aliases for UTF-8 which would likely trigger
the same problem:
# utf_8 codec
'u8' : 'utf_8',
'utf' : 'utf_8',
'utf8' : 'utf_8',
'utf8_ucs2' : 'utf_8',
'utf8_ucs4' : 'utf_8',
|
msg259990 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2016-02-10 07:54 |
I think the correct way is not add "utf8" to special case, but removes "utf-8". Here is a patch.
|
msg260054 - (view) |
Author: Jim Jewett (Jim.Jewett) * |
Date: 2016-02-10 22:57 |
Does (did?) the utf8 special case allow for a much faster startup time, by not requiring all of the codecs machinery?
|
msg260078 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2016-02-11 08:16 |
Serhiy: Removing the shortcut would slow down the tokenizer a lot since UTF-8 encoded source code is the norm, not the exception.
The "problem" here is that the tokenizer trusts the source code in being in the correct encoding when you use one of utf-8 or iso-8859-1 and then skips the usual "decode into unicode, then encode to utf-8" step.
From a purist point of view, you are right, Python should always pass through those steps to detect encoding errors, but from a practical point of view, I think the optimization is fine.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:25 | admin | set | github: 70125 |
2016-12-06 12:14:59 | serhiy.storchaka | link | issue28884 dependencies |
2016-02-11 08:16:29 | lemburg | set | messages:
+ msg260078 |
2016-02-10 22:57:34 | Jim.Jewett | set | nosy:
+ Jim.Jewett messages:
+ msg260054
|
2016-02-10 07:54:20 | serhiy.storchaka | set | files:
+ bad_utf8.patch messages:
+ msg259990
components:
+ Interpreter Core type: behavior stage: patch review |
2015-12-27 12:33:05 | lemburg | set | messages:
+ msg257074 |
2015-12-27 01:05:15 | serhiy.storchaka | set | messages:
+ msg257060 |
2015-12-26 22:05:17 | lemburg | set | messages:
+ msg257051 |
2015-12-26 21:46:17 | vstinner | set | messages:
+ msg257050 |
2015-12-26 20:35:00 | lemburg | set | messages:
+ msg257047 |
2015-12-26 12:27:32 | 王杰 | set | messages:
+ msg257028 |
2015-12-26 11:25:37 | serhiy.storchaka | set | files:
+ bad_iso8859_3.py nosy:
+ serhiy.storchaka messages:
+ msg257026
|
2015-12-26 11:01:34 | vstinner | set | messages:
+ msg257025 |
2015-12-26 11:00:05 | vstinner | set | messages:
+ msg257024 |
2015-12-26 10:59:42 | vstinner | set | files:
+ utf8.patch keywords:
+ patch messages:
+ msg257023
versions:
+ Python 2.7 |
2015-12-26 08:57:27 | 王杰 | set | messages:
+ msg257020 |
2015-12-25 18:03:50 | terry.reedy | set | nosy:
+ terry.reedy messages:
+ msg257005
|
2015-12-25 18:02:56 | terry.reedy | set | nosy:
+ vstinner
|
2015-12-25 18:02:40 | terry.reedy | set | nosy:
+ lemburg, doerwalter
|
2015-12-24 03:49:49 | 王杰 | create | |