Author lemburg
Recipients Jim.Jewett, doerwalter, lemburg, serhiy.storchaka, terry.reedy, vstinner, 王杰
Date 2016-02-11.08:16:28
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
Serhiy: Removing the shortcut would slow down the tokenizer a lot since UTF-8 encoded source code is the norm, not the exception.

The "problem" here is that the tokenizer trusts the source code in being in the correct encoding when you use one of utf-8 or iso-8859-1 and then skips the usual "decode into unicode, then encode to utf-8" step.

From a purist point of view, you are right, Python should always pass through those steps to detect encoding errors, but from a practical point of view, I think the optimization is fine.
Date User Action Args
2016-02-11 08:16:29lemburgsetrecipients: + lemburg, doerwalter, terry.reedy, vstinner, Jim.Jewett, serhiy.storchaka, 王杰
2016-02-11 08:16:29lemburgsetmessageid: <>
2016-02-11 08:16:29lemburglinkissue25937 messages
2016-02-11 08:16:28lemburgcreate