classification
Title: Clearly explain the bare minimum Python 3 users should know about Unicode
Type: enhancement Stage: resolved
Components: Documentation, Unicode Versions: Python 3.4, Python 3.2, Python 3.3
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Unicode HOWTO up to date?
View: 4153
Assigned To: docs@python Nosy List: Jim.Jewett, cvrebert, docs@python, ezio.melotti, flox, giampaolo.rodola, haypo, merwok, nadeem.vawda, ncoghlan, paul.moore, pitrou, terry.reedy, tshepang
Priority: normal Keywords:

Created on 2012-02-12 04:33 by ncoghlan, last changed 2013-01-28 02:20 by ezio.melotti. This issue is now closed.

Messages (27)
msg153164 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-02-12 04:33
(This proposes a new builtin, so may need to become a PEP)

A common programming task is "I want to process this text file, I know it's in an ASCII compatible encoding, I don't know which one specifically, but I'm only manipulating the ASCII parts so it doesn't matter".

In Python 2, you handle that task by doing:

    f = open(fname)

The non-ASCII parts are then carried along as 8-bit bytes and reproduced faithfully when written back out.

In Python 3, you handle it by doing:

    f = open(fname, encoding="ascii", errors="surrogateescape")

The non-ASCII parts are then carried along as code points in the Unicode Private Use Area and reproduced faithfully when written back out.

It would be significantly more friendly to beginners (and migrants from Python 2) if the following shorthand spelling was available out of the box:

    f = open_ascii(fname)
msg153167 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012-02-12 04:38
Would not adding a new keyword arg to open() be less intrusive and more consistent?

I.e. open(fname, asciionly=True) or something similar.
msg153169 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-12 04:48
Hmm, what happened to "not every one-liner should be a builtin function"?

> I'm only manipulating the ASCII parts

How can you be sure about that?

> It would be significantly more friendly to beginners (and migrants from 
> Python 2)

To the point that many of them would stop thinking about the problem, and start producing unicode-incompatible code.
The idea that it may be presented as a porting recipe is IMO a good reason to oppose introducing this new function.
msg153170 - (view) Author: Chris Rebert (cvrebert) * Date: 2012-02-12 04:51
@Bendersky:

Unlike open()'s other arguments, that one wouldn't be orthogonal though. It would be possible to write e.g.:

f = open(fname, encoding="big5", errors="replace", ascii_only=True)

which seems disturbing, IMO. It would be nicer to rule out such impossible combinations categorically.
msg153171 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-02-12 04:51
No point to adding a new keyword arg - if people are going to do something like that, they may as well learn to use "errors" and "encoding" properly.

Adding open_ascii() would be an acknowledgement that "basically ASCII, but maybe with a few other bytes that just need to round-trip correctly" is a common enough use case to special case (in particular, it's convenient to have algorithms than can operate on both utf-8 and all 8-bit extended ASCII variants, including latin-1).

The downside to using surrogateescape is that if you ever *do* feed it a file in a non-ASCII compatible encoding and then perform ASCII-based manipulations, you'll get mojibake instead of an early UnicodeDecodeError. (i.e. exactly the same problem this kind of thing can cause in Python 2)
msg153179 - (view) Author: Éric Araujo (merwok) * (Python committer) Date: 2012-02-12 05:32
IMO it is a fact that the characters used by human languages are stored as bytes by computers, so a programmer needs to know the basics about text handling and encoding.  I don’t like the idea of a built-in function helping people to put their hands on their ears and sing “la-la-la everything is ASCII”.  I know that it is hard to find oneself confronted with a UnicodeDecodeError, I’ve been there, but then I learned.  My hope is that good explanations in the FAQ, howto and library ref (with good PageRank, so that people googling error messages find us) can help people understand how to work with text.

That said, I’m going to re-read Armin’s post about Python 3 and Nick’s reply to it to understand your position better.

can’t-put-my-damn-name-in-so-many-websites-in-2012-grumble’ly yours
msg153182 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-02-12 05:55
Pondering it further (and reading subsequent comments here and in the thread), I agree an open_ascii() builtin would be a step backwards, not forwards.

So, morphing this issue into a documentation one to work out:
- the bare minimum we think Python 3 users should be learning about Unicode
- deciding where to document that (with a reference to the Unicode HOWTO for anyone that wants to know more)

Some ideas specifically in the context of text files (for readers already familiar with the basic concept of text encodings):

1. The world is moving towards standardising on UTF-8 as the binary encoding used to store text files. However, we're a long way from living in that world right now. Other encodings (many, but far from all, ASCII compatible) will be encountered quite often, either as the default encoding on a particular platform, or as the encoding of a particular text file. Dealing with these correctly requires additional work.

2. To maximise the chance of correct local interoperability, Python 3's default choice of encoding is actually taken from the underlying platform rather than being forced to UTF-8. While it is becoming more and more common for platforms to set their preferred encoding to UTF-8, this is not yet universal (notably, Windows still does not use UTF-8 as the default encoding for text files in order to preserve compatibility with various Unicode-unaware legacy applications).

To handle this correctly in cross-platform applications and libraries, it is often necessary to explicitly pass "encoding='utf-8'" when opening a UTF-8 encoded text file.

The default encoding on a given platform can be checked by running "import locale; locale.getpreferredencoding()" at the interactive prompt.

3. Currently, it is still fairly common to encounter text files that are known to be stored in an ASCII-compatible text encoding without knowing precisely *which* encoding is used. The Python 2 text model allowed such files to be processed naively simply by assuming they were in an ASCII-compatible encoding and passing any non-ASCII characters faithfully through to the result. This permissive behaviour can be requested explicitly in Python 3 by passing "encoding='ascii'" and "errors='surrogateescape'" when opening a text file.

This approach parallels the behaviour of Python 2 and works correctly so long as it is fed data solely in ASCII compatible encodings (such as UTF-8 and latin-1). Behaviour when fed data that uses other encodings is unpredictable - common symptoms include Unicode encoding and decoding errors at unexpected points in a program, as well as silent corruption of the output text.
msg153183 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012-02-12 06:00
If the concept is accepted. I see no better place for this than the
Unicode HOWTO. If it's too long, then a TL;DR; section should be added
in the beginning detailing "the bare minimum". No need to scatter such
information in bits and pieces around the documentation. That's what
the Unicode HOWTO is for.
msg153189 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-02-12 08:58
> A common programming task is "I want to process this text file,
> I know it's in an ASCII compatible encoding, I don't know which
> one specifically, but I'm only manipulating the ASCII parts
> so it doesn't matter".

Can you give more detail about this use case? Why would you ignore non-ASCII characters?
msg153190 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-02-12 09:19
Usually because the file may contain certain ASCII markers (or you're inserting such markers), but beyond that, you only care that it's in a consistent ASCII compatible encoding.

Parsing log files from sources that aren't set up correctly often falls into this category - you know the markers are ASCII, but the actual message contents may not be properly encoded. (e.g. they use a locale dependent encoding, but not all the log files are from the same machine and not all machines have their locale set up properly). (although errors="replace" can be a better option for such "read-only" use cases).

A use case where you really do need "errors='surrogateescape'" is when you're reformatting a log file and you want to preserve the encoding for the messages while manipulating the pure ASCII timestamps and message headers. In that case, surrogateescape is the right answer, because you can manipulate the ASCII bits freely while preserving the log message contents when you write the reformatted files back out. The reformatting script offers an API that says "put any ASCII compatible encoding in, and you'll get that same encoding back out".

You'll get weird behaviour (i.e. as you do in Python 2) if the assumption of an ASCII compatible encoding is ever violated, but that would be equally true if the script tried to process things at the raw bytes level.

The assumption of an ASCII compatibile text encoding is a useful one a lot of the time. The problem with Python 2 is it makes that assumption implicitly, and makes it almost impossible to disable it. Python 3, on the other hand, assumes very little by default (basically what it returns from sys.getfilesystemencoding() and locale.getpreferredencoding()), this requiring that the programmer know how to state their assumptions explicitly.
msg153191 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-02-12 09:55
Why do you use Unicode with the ugly surrogateescape error handler in
this case? Bytes are just fine for such usecase.

The surrogateescape error handler produces unusual characters in range
U+DC80-U+DCFF which cannot be printed to a console because sys.stdout
uses the strict error handler, and sys.stderr  uses the
backslashreplace error handler. If I remember correctly, only UTF-7
encoder allow lone surrogate characters.
msg153198 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-02-12 11:18
If such use cases are indeed better handled as bytes, then that's what should be documented. However, there are some text processing assumptions that no longer hold when using bytes instead of strings (such as "x[0:1] == x[0]"). You also can't safely pass such byte sequences to various other APIs (e.g. urllib.parse will happily process surrogate escaped text without corrupting them, but will throw UnicodeDecodeError for bytes sequences that aren't pure 7-bit ASCII).

Using surrogateescape instead means that you're only going to have problems if you go to encode the data to an encoding other than the source one. That's basically the things work in Python 2 with 8-bit strings.
msg153202 - (view) Author: Paul Moore (paul.moore) * (Python committer) Date: 2012-02-12 12:16
A better example in terms of "intended to be text" might be ChangeLog files. These are clearly text files, but of sufficiently standard format that they can be manipulated programmatically.

Consider a program to get a list of all authors who changed a particular file. Scan the file for date lines, then scan the block of text below for the filename you care about. Extract the author from the date line, put into a set, sort and print.

All of this can be done assuming the file is ASCII-compatible, but requires non-trivial text processing that would be a pain to do on bytes. But author names are quite likely to be non-ASCII, especially if it's an international project. And the changelog file is manually edited by people on different machines, so the possibility of inconsistent encodings is definitely there. (I have seen this happen - it's not theoretical!)

For my code, all I care about is that the names round-trip, so that I'm not damaging people's names any more than has already happened.

encoding="ascii",errors="surrogateescape" sounds like precisely the right answer here.

(If it's hard to find a good answer in Python 3, it's very easy to decide to use Python 2 which "just works", or even other tools like awk which also take Python 2's naive approach - and dismiss Python 3's Unicode model as "too hard").

My mental model here is text editors, which let you open any file, do their best to display as much as they can and allow you to manipulate it without damaging the bits you don't change. I don't see any reason why people shouldn't be able to write Python 3 code that way if they need to.
msg153206 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-12 14:53
> My mental model here is text editors, which let you open any file, do
> their best to display as much as they can and allow you to manipulate
> it without damaging the bits you don't change. I don't see any reason
> why people shouldn't be able to write Python 3 code that way if they
> need to.

Some text editors try to guess the encoding, which is different from
"display invalid characters anyway".
Other text editors like gedit pop up an error when there are invalid
bytes according to the configured encoding.

That said, people *are* able to write Python 3 code the way you said.
They simply have to use the "surrogateescape" error handler.
msg153360 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-02-14 18:57
See bugs/python.org/issue14015 for one reason that surrogateescape isn't better known.
msg153606 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-02-17 22:25
I agree with no new builtin and appreciate that being taken off the table.

I think the place is the Unicode How-to. I think that document should be renamed Encodings and Unicode How-to. The reasons are 1) one has to first understand the concept of encoding characters and text as numbers, and 2) this issue (and the python-ideas discussion) is not about Unicode, but about using pre- (and non-)Unicode encodings with Python3's bytes and string types, and how that differs in Python3 versus using Python2's unicode and string types. If only Unicode encodings were used, with utf-8 dominant on the Internet (and it is now most common for web pages), the problems of concern here would not exist.

Learning about Unicode would mean learning about code units versus codepoints, normal versus surrogate chars, BMP versus extended chars (all of which are non-issues in wide builds and Py 3.3), 256-char planes, BOMs, surrogates, normalization forms, and character properties. While sometimes useful, these subjects are not the issue here.
msg153612 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-02-18 00:19
FWIW I recently made a talk at PyCon Finland called "Understanding Encodings" that goes through the things you mentioned in the last message.

I could turn that in a patch for the Unicode Howto.
msg153645 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-02-18 14:58
The other thing that came out of the rambling Unicode thread on python-ideas is that we should clearly articulate the options for processing files in a task-based fashion and describe the trade-offs for the different alternatives.

I started writing up my notes on that as a tracker comment, but the became a little... long: http://readthedocs.org/docs/ncoghlan_devs-python-notes/en/latest/py3k_text_file_processing.html
msg153653 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-02-18 19:42
Yes, the 'how to' alternatives, with + and -, should be included in the doc addition. I thought it the best thing to come out of the python-ideas thread.
msg157210 - (view) Author: Chris Rebert (cvrebert) * Date: 2012-03-31 17:16
Links to the "rambling Unicode thread"s for posterity and convenience:

Gets into several issues, among them, Unicode:
http://mail.python.org/pipermail/python-ideas/2012-February/013665.html

Unicode-specific offshoot of the above:
http://mail.python.org/pipermail/python-ideas/2012-February/013825.html
msg180728 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-01-27 01:15
What's the status of this?

Issue #4153 might also be related.
msg180743 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-01-27 05:08
Current status:

#14015 is still valid (i.e. surrogateescape is not well documented)
#4153: the Unicode HOWTO still covers more than the bare minimum people need to know
Ned Batchelder's "Pragmatic Unicode" is one of the best intros to the topic I have seen: http://nedbatchelder.com/text/unipain.html

My full notes on the topic, which I'm still happy with as a "bare minimum Python 3 users should know about Unicode" are available at http://python-notes.boredomandlaziness.org/en/latest/python3/text_file_processing.html
msg180775 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-01-27 16:48
Maybe the Unicode HOWTO could be reorganized so that it first introduces the bare minimum and then expands the concepts for whoever wants to know more?
Or should we have a "basic" and an "advanced" Unicode HOWTO?
msg180793 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-01-27 20:49
I basically agree with Ezio. The doc currently starts with

Introduction to Unicode
History of Character Codes
...

It ends with

Tips for Writing Unicode-aware Programs.
  ...
  The most important tip is:
    Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end.

I think the how-to should *start* with that general principle and continue with the specific task-based how-tos from the thread. This will tell people who at least vaguely know the following material how to get going in a practical manner.
msg180795 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-01-27 21:00
If we agree on this, I can propose a patch in #4153 and this issue can be closed.
msg180819 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-01-28 01:50
Include a couple of "See Also" links out to my essay and Ned's article and that sounds good to me.

(Assuming I've adjusted the DNS settings correctly, this alternate URL for my essay should start working soon: http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html
msg180821 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-01-28 02:20
OK, I'm going to close this then.

I'll take a look at the links and see if what they say can be included in the HOWTO.  As I mentioned in an earlier post I made a few talks about Unicode and encodings, so I will take some material from there too.  Depending on the final result we can then decide if and what additional links are necessary.
History
Date User Action Args
2013-01-28 02:20:15ezio.melottisetstatus: open -> closed
superseder: Unicode HOWTO up to date?
messages: + msg180821

resolution: duplicate
stage: needs patch -> resolved
2013-01-28 01:50:02ncoghlansetmessages: + msg180819
2013-01-27 21:00:29ezio.melottisetmessages: + msg180795
2013-01-27 20:49:40terry.reedysetmessages: + msg180793
versions: + Python 3.4
2013-01-27 16:48:13ezio.melottisetmessages: + msg180775
2013-01-27 05:08:02ncoghlansetmessages: + msg180743
2013-01-27 01:15:58ezio.melottisetmessages: + msg180728
2012-07-15 03:58:19eli.benderskysetnosy: - eli.bendersky
2012-03-31 17:16:12cvrebertsetmessages: + msg157210
2012-02-18 19:42:42terry.reedysetmessages: + msg153653
2012-02-18 14:58:56ncoghlansetmessages: + msg153645
2012-02-18 00:19:12ezio.melottisetmessages: + msg153612
2012-02-17 22:25:24terry.reedysetnosy: + terry.reedy
messages: + msg153606
2012-02-14 18:57:28Jim.Jewettsetnosy: + Jim.Jewett
messages: + msg153360
2012-02-13 09:45:36tshepangsetnosy: + tshepang
2012-02-13 09:13:27giampaolo.rodolasetnosy: + giampaolo.rodola
2012-02-12 14:53:11pitrousetmessages: + msg153206
2012-02-12 13:00:19floxsetnosy: + flox
2012-02-12 12:16:32paul.mooresetnosy: + paul.moore
messages: + msg153202
2012-02-12 11:18:48ncoghlansetmessages: + msg153198
2012-02-12 10:37:04nadeem.vawdasetnosy: + nadeem.vawda
2012-02-12 09:55:48hayposetmessages: + msg153191
2012-02-12 09:19:38ncoghlansetmessages: + msg153190
2012-02-12 08:58:46hayposetnosy: + haypo
messages: + msg153189
2012-02-12 08:17:48ezio.melottisetnosy: + ezio.melotti

type: enhancement
components: + Unicode
stage: needs patch
2012-02-12 06:00:46eli.benderskysetmessages: + msg153183
2012-02-12 05:56:22ncoghlansetnosy: + docs@python
title: Add open_ascii() builtin -> Clearly explain the bare minimum Python 3 users should know about Unicode
assignee: docs@python
versions: + Python 3.2, Python 3.3
components: + Documentation
2012-02-12 05:55:42ncoghlansetmessages: + msg153182
2012-02-12 05:32:07merwoksetnosy: + merwok
messages: + msg153179
2012-02-12 04:51:58ncoghlansetmessages: + msg153171
2012-02-12 04:51:43cvrebertsetmessages: + msg153170
2012-02-12 04:48:21pitrousetnosy: + pitrou
messages: + msg153169
2012-02-12 04:38:49eli.benderskysetnosy: + eli.bendersky
messages: + msg153167
2012-02-12 04:37:04cvrebertsetnosy: + cvrebert
2012-02-12 04:33:06ncoghlancreate