Message 260078 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	Jim.Jewett, doerwalter, lemburg, serhiy.storchaka, terry.reedy, vstinner, 王杰
Date	2016-02-11.08:16:28
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1455178589.31.0.0599435094671.issue25937@psf.upfronthosting.co.za>
In-reply-to

Content
Serhiy: Removing the shortcut would slow down the tokenizer a lot since UTF-8 encoded source code is the norm, not the exception. The "problem" here is that the tokenizer trusts the source code in being in the correct encoding when you use one of utf-8 or iso-8859-1 and then skips the usual "decode into unicode, then encode to utf-8" step. From a purist point of view, you are right, Python should always pass through those steps to detect encoding errors, but from a practical point of view, I think the optimization is fine.

Serhiy: Removing the shortcut would slow down the tokenizer a lot since UTF-8 encoded source code is the norm, not the exception.

The "problem" here is that the tokenizer trusts the source code in being in the correct encoding when you use one of utf-8 or iso-8859-1 and then skips the usual "decode into unicode, then encode to utf-8" step.

From a purist point of view, you are right, Python should always pass through those steps to detect encoding errors, but from a practical point of view, I think the optimization is fine.

History
Date	User	Action	Args
2016-02-11 08:16:29	lemburg	set	recipients: + lemburg, doerwalter, terry.reedy, vstinner, Jim.Jewett, serhiy.storchaka, 王杰
2016-02-11 08:16:29	lemburg	set	messageid: <1455178589.31.0.0599435094671.issue25937@psf.upfronthosting.co.za>
2016-02-11 08:16:29	lemburg	link	issue25937 messages
2016-02-11 08:16:28	lemburg	create