classification
Title: python é.py fails with UnicodeEncodeError if PYTHONFSENCODING is used
Type: Stage:
Components: Interpreter Core, Unicode Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: eric.araujo, r.david.murray, vstinner
Priority: normal Keywords: patch

Created on 2010-10-07 01:25 by vstinner, last changed 2010-10-18 17:03 by eric.araujo. This issue is now closed.

Files
File name Uploaded Description Edit
redecode_filename.patch vstinner, 2010-10-07 01:25 review
Messages (7)
msg118089 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-07 01:25
If a program name contains a non-ascii character in its name and/or full path and PYTHONFSENCODING is set to an encoding different than the locale encoding, Python fails to open the program.

Example in the utf-8 locale:

$ PYTHONFSENCODING=ascii ./python é.py
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)

This issue is similar to #9992 and #10014.

Solutions: remove PYTHONFSENCODING environment variable or redecode the filename from the locale encoding to the filesystem encoding.

Attached patch implements the latter.

--

We may also redecode Py_GetProgramName().
msg118436 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-10-12 17:06
I don’t understand why reading a filename would not respect the envvar stating the filesystem encoding.
msg118444 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-12 17:29
Éric, if you are saying, "the user asked for it, it *should* fail", then that is indeed one of the arguments put forward in issue 9992 where this was discussed.  But I think the emerging consensus is that it is better to just avoid the problem by always using the locale on Unix, and solve the problem that PYTHONFSENCODING was supposed to solve in a different way (by always using utf-8 on OSX and unicode on Windows).
msg118445 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-10-12 17:39
> if you are saying, "the user asked for it, it *should* fail", then
> that is indeed one of the arguments put forward in issue 9992 where
> this was discussed.
You could put it that way, thanks for phrasing my thoughts :)

> But I think the emerging consensus is that it is better to just avoid
> the problem by always using the locale on Unix,
*displays his lack of knowledge* Is it always correct to decode a filename with the locale encoding on Unix?  Can’t each filesystem have its own encoding?

> and solve the problem that PYTHONFSENCODING was supposed to solve in a
> different way (by always using utf-8 on OSX and unicode on Windows).
If there is a better alternate way, let’s go for it, and maybe remove PYTHONFSENCODING altogether, since it’s new in 3.2.

Thanks for explaining!  I’ll repay your time by reviewing the doc patches.
msg118492 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-13 00:25
> Is it always correct to decode a filename with the locale encoding
> on Unix?

Do you know something better than the locale encoding? I don't.

> Can’t each filesystem have its own encoding?

Yes, but how do you get the encoding of each filesystem? I think that few or no application support such case without mojibake. Backup programs can use the "raw" (bytes) API of Python 3 to avoid all encoding issues.

--

As wrote R. David Murray, read issue #9992 if you would like to know more about this problem and the different proposed solutions. I voted for removal of PYTHONFSENCODING which fix most issues.
msg118593 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-13 22:20
Fixed by r85430 (remove PYTHONFSENCODING), see #9992.
msg119039 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-10-18 17:03
> Do you know something better than the locale encoding? I don't.
Neither do I, sorry.

>> Can’t each filesystem have its own encoding?
> Yes, but how do you get the encoding of each filesystem?
If I really had to, on linux I could parse the output of the mount command, but this could get messy quickly, and of course is not okay for official Python.

> Backup programs can use the "raw" (bytes) API of Python 3 to avoid
> all encoding issues.
Neat!

> As wrote R. David Murray, read issue #9992 if you would like to know
> more about this problem and the different proposed solutions.
I did so, thanks for the pointer and all the explanations.
History
Date User Action Args
2010-10-18 17:03:42eric.araujosetmessages: + msg119039
2010-10-13 22:20:20vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg118593
2010-10-13 00:25:38vstinnersetmessages: + msg118492
2010-10-12 17:39:45eric.araujosetmessages: + msg118445
2010-10-12 17:29:26r.david.murraysetnosy: + r.david.murray
messages: + msg118444
2010-10-12 17:06:01eric.araujosetnosy: + eric.araujo
messages: + msg118436
2010-10-07 11:33:05vstinnerlinkissue10014 dependencies
2010-10-07 01:25:31vstinnercreate