Message 257074 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	doerwalter, lemburg, serhiy.storchaka, terry.reedy, vstinner, 王杰
Date	2015-12-27.12:33:05
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<567FDA7E.5020405@egenix.com>
In-reply-to	<1451178315.58.0.0417010168097.issue25937@psf.upfronthosting.co.za>

Content
On 27.12.2015 02:05, Serhiy Storchaka wrote: > >> I wonder why this does not trigger the exception. > > Because in case of utf-8 and iso-8859-1 decoding and encoding steps are omitted. > > In general case the input is decoded from specified encoding and than encoded to UTF-8 for parser. But for utf-8 and iso-8859-1 encodings the parser gets the raw data. Right, but since the tokenizer doesn't know about "utf8" it should reach out to the codec registry to get a properly encoded version of the source code (even though this is an unnecessary round-trip). There are few other aliases for UTF-8 which would likely trigger the same problem: # utf_8 codec 'u8' : 'utf_8', 'utf' : 'utf_8', 'utf8' : 'utf_8', 'utf8_ucs2' : 'utf_8', 'utf8_ucs4' : 'utf_8',

On 27.12.2015 02:05, Serhiy Storchaka wrote:
> 
>> I wonder why this does not trigger the exception.
> 
> Because in case of utf-8 and iso-8859-1 decoding and encoding steps are omitted.
>
> In general case the input is decoded from specified encoding and than encoded to UTF-8 for parser. But for utf-8 and iso-8859-1 encodings the parser gets the raw data.

Right, but since the tokenizer doesn't know about "utf8" it
should reach out to the codec registry to get a properly encoded
version of the source code (even though this is an unnecessary
round-trip).

There are few other aliases for UTF-8 which would likely trigger
the same problem:

    # utf_8 codec
    'u8'                 : 'utf_8',
    'utf'                : 'utf_8',
    'utf8'               : 'utf_8',
    'utf8_ucs2'          : 'utf_8',
    'utf8_ucs4'          : 'utf_8',

History
Date	User	Action	Args
2015-12-27 12:33:05	lemburg	set	recipients: + lemburg, doerwalter, terry.reedy, vstinner, serhiy.storchaka, 王杰
2015-12-27 12:33:05	lemburg	link	issue25937 messages
2015-12-27 12:33:05	lemburg	create