Author ncoghlan
Recipients cvrebert, eli.bendersky, eric.araujo, ncoghlan, pitrou
Date 2012-02-12.05:55:41
SpamBayes Score 0.0
Marked as misclassified No
Message-id <1329026142.73.0.733005146129.issue13997@psf.upfronthosting.co.za>
In-reply-to
Content
Pondering it further (and reading subsequent comments here and in the thread), I agree an open_ascii() builtin would be a step backwards, not forwards.

So, morphing this issue into a documentation one to work out:
- the bare minimum we think Python 3 users should be learning about Unicode
- deciding where to document that (with a reference to the Unicode HOWTO for anyone that wants to know more)

Some ideas specifically in the context of text files (for readers already familiar with the basic concept of text encodings):

1. The world is moving towards standardising on UTF-8 as the binary encoding used to store text files. However, we're a long way from living in that world right now. Other encodings (many, but far from all, ASCII compatible) will be encountered quite often, either as the default encoding on a particular platform, or as the encoding of a particular text file. Dealing with these correctly requires additional work.

2. To maximise the chance of correct local interoperability, Python 3's default choice of encoding is actually taken from the underlying platform rather than being forced to UTF-8. While it is becoming more and more common for platforms to set their preferred encoding to UTF-8, this is not yet universal (notably, Windows still does not use UTF-8 as the default encoding for text files in order to preserve compatibility with various Unicode-unaware legacy applications).

To handle this correctly in cross-platform applications and libraries, it is often necessary to explicitly pass "encoding='utf-8'" when opening a UTF-8 encoded text file.

The default encoding on a given platform can be checked by running "import locale; locale.getpreferredencoding()" at the interactive prompt.

3. Currently, it is still fairly common to encounter text files that are known to be stored in an ASCII-compatible text encoding without knowing precisely *which* encoding is used. The Python 2 text model allowed such files to be processed naively simply by assuming they were in an ASCII-compatible encoding and passing any non-ASCII characters faithfully through to the result. This permissive behaviour can be requested explicitly in Python 3 by passing "encoding='ascii'" and "errors='surrogateescape'" when opening a text file.

This approach parallels the behaviour of Python 2 and works correctly so long as it is fed data solely in ASCII compatible encodings (such as UTF-8 and latin-1). Behaviour when fed data that uses other encodings is unpredictable - common symptoms include Unicode encoding and decoding errors at unexpected points in a program, as well as silent corruption of the output text.
History
Date User Action Args
2012-02-12 05:55:42ncoghlansetrecipients: + ncoghlan, pitrou, eric.araujo, eli.bendersky, cvrebert
2012-02-12 05:55:42ncoghlansetmessageid: <1329026142.73.0.733005146129.issue13997@psf.upfronthosting.co.za>
2012-02-12 05:55:42ncoghlanlinkissue13997 messages
2012-02-12 05:55:41ncoghlancreate