classification
Title: TextIOWrapper: Unicode Fallback Encoding on Python 3.3
Type: Stage:
Components: Unicode Versions: Python 3.3
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: loewis Nosy List: aronacher, carljm, loewis, vstinner
Priority: normal Keywords:

Created on 2011-03-16 17:26 by aronacher, last changed 2017-12-19 01:01 by vstinner. This issue is now closed.

Messages (8)
msg131144 - (view) Author: Armin Ronacher (aronacher) * (Python committer) Date: 2011-03-16 17:26
Right now Python happily falls back to ASCII if it can not parse your LC_CTYPE or something similar happens.  Instead of falling back to ASCII it would be better if it falls back to UTF-8.

Alternatively it should at least give a warning that it's falling back to ASCII.

This issue was discussed at PyCon and the consensus so far was that falling back to UTF-8 in 3.3 might be a good idea and should not break much code as UTF-8 is a superset of ASCII.
msg131290 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-03-17 22:02
In my experience (PYTHONFSENCODING, sys.setfilesystemencoding()): Python should just use the same encoding than the locale encoding because *all* other programs on the system use the locale encoding. If none of LANG, LC_ALL or LC_CTYPE env var is set: Python does use ASCII just because nl_langinfo() answers ASCII.

Said differently: get_codeset() doesn't fail if there is no environment variable. If get_codeset() does fail: Python stops immediatly with a fatal error, it doesn't fallback to ASCII or something like that.

Python < 3.2 used ASCII at startup until the locale encoding codec was loaded (to avoid a bootstrap issue). But I fixed the bootstrap issue in Python 3.2: Python does now *always* use the locale encoding, even at startup. Before the codec is complelty loaded: Python uses _Py_char2wchar() to decode filenames (and other data).

For more information, see also a previous attempt: issue #8725.
msg131327 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-03-18 12:59
After reading the related mail thread on python-dev, I realized that you are talking about TextIOWrapper choice (file content, not file name). My previous message is about file names.
msg131329 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-03-18 13:04
TextIOWrapper is mostly based on locale.getpreferredencoding(), so msg131290 is still valid: if no env var is set, nl_langinfo() gives 'ASCII' (or something like that). But it is not easy to detect that env vars are not set.

I would prefer a completly different approach: always use UTF-8 by default if the encoding is not set.

Something like:
def open(filename, ..., encoding='UTF-8', ...)
TextIOWrapper.__init__(..., encoding='UTF-8', ...)

So not rely on locales anymore.
msg149473 - (view) Author: Carl Meyer (carljm) * Date: 2011-12-14 20:45
Here's an example real-world case where the only solution I could find was to simply avoid non-ASCII characters entirely (which is obviously not a real solution): https://github.com/pypa/virtualenv/issues/201#issuecomment-3145690

distutils/distribute require long_description to be a string, not bytes (so it can rfc822-escape it, and use string methods to do so), but does not explicitly set an output encoding when it writes egg-info. This means that a developer either has the choice to a) break installation of their package on any system with an ASCII default locale, or b) not use any non-ASCII characters in long_description.

One might say, "ok, this is a bug in distutils/distribute, it should explicitly specify UTF-8 encoding when writing egg-info." But if this is a sensible thing for distutils/distribute to do, regardless of user locale, why would it not be equally sensible for Python itself to have the default output encoding always be UTF-8 (with the ability for a developer who wants to support arbitrary user locale to explicitly do so)?
msg149476 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-12-14 21:07
> One might say, "ok, this is a bug in distutils/distribute, it should
> explicitly specify UTF-8 encoding when writing egg-info." But if this
> is a sensible thing for distutils/distribute to do, regardless of
> user locale, why would it not be equally sensible for Python itself
> to have the default output encoding always be UTF-8 (with the ability
> for a developer who wants to support arbitrary user locale to
> explicitly do so)?

The file encoding is part of the file format. Just as Python can't know
what the file format is (else it could allow writing, say, dictionaries
to a file), it can't know what the file encoding is, either - there is
a need to guess. distutils *does* know the format, so it's clearly a
bug in distutils and not in Python.

The Zen says "In the face of ambiguity, refuse the temptation to guess."
From that point of view, Python should just refuse to open files in text
mode with no encoding specified. However, it also says "Although
practicality beats purity.", which brings us back to guessing.

Guessing the "best" file encoding is really tricky, and Python has
chosen to use the locale's encoding. That can't be changed anymore
(except perhaps by PEP) since it would be an incompatible change.
msg159340 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-04-25 22:50
I don't think that using a fallback is a good idea. So I'm closing the issue. You can reopen the discussion on the python-dev mailing list if you don't agree with me or Martin.
msg308602 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-19 01:01
Follow-up: the PEP 538 (bpo-28180) and PEP 540 (bpo-29240) have been accepted and implemented in Python 3.7!
History
Date User Action Args
2017-12-19 01:01:55vstinnersetmessages: + msg308602
2012-04-25 22:50:48vstinnersetstatus: open -> closed
resolution: wont fix
messages: + msg159340
2011-12-14 21:07:11loewissetmessages: + msg149476
2011-12-14 20:45:50carljmsetnosy: + carljm
messages: + msg149473
2011-09-29 20:10:05vstinnersettitle: Unicode Fallback Encoding on Python 3.3 -> TextIOWrapper: Unicode Fallback Encoding on Python 3.3
2011-03-18 13:04:55vstinnersetnosy: loewis, vstinner, aronacher
messages: + msg131329
2011-03-18 12:59:55vstinnersetnosy: loewis, vstinner, aronacher
messages: + msg131327
2011-03-17 22:02:37vstinnersetnosy: loewis, vstinner, aronacher
messages: + msg131290
2011-03-17 00:37:38ned.deilysetnosy: + vstinner
2011-03-16 17:27:04aronachersetassignee: loewis

nosy: + loewis
2011-03-16 17:26:33aronachercreate