Message 153190 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ncoghlan
Recipients	cvrebert, docs@python, eli.bendersky, eric.araujo, ezio.melotti, ncoghlan, pitrou, vstinner
Date	2012-02-12.09:19:37
SpamBayes Score	9.436896e-16
Marked as misclassified	No
Message-id	<1329038378.58.0.474374234993.issue13997@psf.upfronthosting.co.za>
In-reply-to

Content
Usually because the file may contain certain ASCII markers (or you're inserting such markers), but beyond that, you only care that it's in a consistent ASCII compatible encoding. Parsing log files from sources that aren't set up correctly often falls into this category - you know the markers are ASCII, but the actual message contents may not be properly encoded. (e.g. they use a locale dependent encoding, but not all the log files are from the same machine and not all machines have their locale set up properly). (although errors="replace" can be a better option for such "read-only" use cases). A use case where you really do need "errors='surrogateescape'" is when you're reformatting a log file and you want to preserve the encoding for the messages while manipulating the pure ASCII timestamps and message headers. In that case, surrogateescape is the right answer, because you can manipulate the ASCII bits freely while preserving the log message contents when you write the reformatted files back out. The reformatting script offers an API that says "put any ASCII compatible encoding in, and you'll get that same encoding back out". You'll get weird behaviour (i.e. as you do in Python 2) if the assumption of an ASCII compatible encoding is ever violated, but that would be equally true if the script tried to process things at the raw bytes level. The assumption of an ASCII compatibile text encoding is a useful one a lot of the time. The problem with Python 2 is it makes that assumption implicitly, and makes it almost impossible to disable it. Python 3, on the other hand, assumes very little by default (basically what it returns from sys.getfilesystemencoding() and locale.getpreferredencoding()), this requiring that the programmer know how to state their assumptions explicitly.

Usually because the file may contain certain ASCII markers (or you're inserting such markers), but beyond that, you only care that it's in a consistent ASCII compatible encoding.

Parsing log files from sources that aren't set up correctly often falls into this category - you know the markers are ASCII, but the actual message contents may not be properly encoded. (e.g. they use a locale dependent encoding, but not all the log files are from the same machine and not all machines have their locale set up properly). (although errors="replace" can be a better option for such "read-only" use cases).

A use case where you really do need "errors='surrogateescape'" is when you're reformatting a log file and you want to preserve the encoding for the messages while manipulating the pure ASCII timestamps and message headers. In that case, surrogateescape is the right answer, because you can manipulate the ASCII bits freely while preserving the log message contents when you write the reformatted files back out. The reformatting script offers an API that says "put any ASCII compatible encoding in, and you'll get that same encoding back out".

You'll get weird behaviour (i.e. as you do in Python 2) if the assumption of an ASCII compatible encoding is ever violated, but that would be equally true if the script tried to process things at the raw bytes level.

The assumption of an ASCII compatibile text encoding is a useful one a lot of the time. The problem with Python 2 is it makes that assumption implicitly, and makes it almost impossible to disable it. Python 3, on the other hand, assumes very little by default (basically what it returns from sys.getfilesystemencoding() and locale.getpreferredencoding()), this requiring that the programmer know how to state their assumptions explicitly.

History
Date	User	Action	Args
2012-02-12 09:19:38	ncoghlan	set	recipients: + ncoghlan, pitrou, vstinner, ezio.melotti, eric.araujo, eli.bendersky, cvrebert, docs@python
2012-02-12 09:19:38	ncoghlan	set	messageid: <1329038378.58.0.474374234993.issue13997@psf.upfronthosting.co.za>
2012-02-12 09:19:38	ncoghlan	link	issue13997 messages
2012-02-12 09:19:37	ncoghlan	create