This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Decode command line arguments from ASCII on FreeBSD and Solaris if the locale is C
Type: Stage:
Components: Unicode Versions: Python 3.2, Python 3.3, Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, jcea, python-dev, vstinner
Priority: normal Keywords: patch

Created on 2012-11-11 22:14 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
workaround_codeset.patch vstinner, 2012-11-11 23:34 review
force_ascii.patch vstinner, 2012-11-12 14:40 review
Messages (11)
msg175401 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-11 22:14
On FreeBSD and OpenIndiana, sys.getfilesystemencoding() is 'ascii' when the locale is not set, whereas the locale encoding is ISO-8859-1.

This inconsistency causes different issue. For example, os.fsencode(sys.argv[1]) fails if the argument is not ASCII because sys.argv are decoded from the locale encoding (by _Py_char2wchar()).

sys.getfilesystemencoding() is 'ascii' because nl_langinfo(CODESET) is used to to get the locale encoding and nl_langinfo(CODESET) announces ASCII (or an alias of this encoding).

Python should detect this case and set sys.getfilesystemencoding() to 'iso8859-1' if the locale encoding is 'iso8859-1' whereas nl_langinfo(CODESET) announces ASCII. We can for example decode b'\xe9' with mbstowcs() and check if it fails or if the result is U+00E9.
msg175408 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-11 23:34
Attached patch works around the CODESET issue on OpenIndiana and FreeBSD. If the LC_CTYPE locale is "C" and nl_langinfo(CODESET) returns ASCII (or an alias of this encoding), b"\xE9" is decoded from the locale encoding: if the result is U+00E9, the patch Python uses ISO-8859-1. (If decoding fails, the locale encoding is really ASCII, the workaround is not used.)

If the result is different (b'\xe9' is not decoded from the locale encoding to U+00E9), a ValueError is raised. I wrote this test to detect bugs. I hope that our buildbots will validate the code. We may choose a different behaviour (ex: keep ASCII).

Example on FreeBSD 8.2, original Python 3.4:

$ ./python
>>> import sys, locale
>>> sys.getfilesystemencoding()
'ascii'
>>> locale.getpreferredencoding()
'US-ASCII'

Example on FreeBSD 8.2, patched Python 3.4:

$ ./python 
>>> import sys, locale
>>> sys.getfilesystemencoding()
'iso8859-1'
>>> locale.getpreferredencoding()
'iso8859-1'
msg175410 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-11 23:54
Some tests are failing with the patch:

======================================================================
FAIL: test_undecodable_env (test.test_subprocess.POSIXProcessTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/haypo/prog/python/default/Lib/test/test_subprocess.py", line 1606, in test_undecodable_env
    self.assertEqual(stdout.decode('ascii'), ascii(value))
AssertionError: "'abc\\xff'" != "'abc\\udcff'"
- 'abc\xff'
?      ^
+ 'abc\udcff'
?      ^^^

======================================================================
FAIL: test_strcoll_with_diacritic (test.test_locale.TestEnUSCollation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/haypo/prog/python/default/Lib/test/test_locale.py", line 364, in test_strcoll_with_diacritic
    self.assertLess(locale.strcoll('\xe0', 'b'), 0)
AssertionError: 126 not less than 0

======================================================================
FAIL: test_strxfrm_with_diacritic (test.test_locale.TestEnUSCollation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/haypo/prog/python/default/Lib/test/test_locale.py", line 367, in test_strxfrm_with_diacritic
    self.assertLess(locale.strxfrm('\xe0'), locale.strxfrm('b'))
AssertionError: '\xe0' not less than 'b'
msg175446 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-12 14:40
Hijacking locale.getpreferredencoding() is maybe dangerous. I attached a
new patch, force_ascii.patch, which uses a different approach: be more
strict than mbstowcs(), force the ASCII encoding when:
 - the LC_CTYPE locale is C
 - nl_langinfo(CODESET) is ASCII or an alias of ASCII
 - mbstowcs() is able to decode non-ASCII characters

2012/11/12 STINNER Victor <report@bugs.python.org>

>
> STINNER Victor added the comment:
>
> Some tests are failing with the patch:
>
> ======================================================================
> FAIL: test_undecodable_env (test.test_subprocess.POSIXProcessTestCase)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/usr/home/haypo/prog/python/default/Lib/test/test_subprocess.py",
> line 1606, in test_undecodable_env
>     self.assertEqual(stdout.decode('ascii'), ascii(value))
> AssertionError: "'abc\\xff'" != "'abc\\udcff'"
> - 'abc\xff'
> ?      ^
> + 'abc\udcff'
> ?      ^^^
>
> ======================================================================
> FAIL: test_strcoll_with_diacritic (test.test_locale.TestEnUSCollation)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/usr/home/haypo/prog/python/default/Lib/test/test_locale.py", line
> 364, in test_strcoll_with_diacritic
>     self.assertLess(locale.strcoll('\xe0', 'b'), 0)
> AssertionError: 126 not less than 0
>
> ======================================================================
> FAIL: test_strxfrm_with_diacritic (test.test_locale.TestEnUSCollation)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/usr/home/haypo/prog/python/default/Lib/test/test_locale.py", line
> 367, in test_strxfrm_with_diacritic
>     self.assertLess(locale.strxfrm('\xe0'), locale.strxfrm('b'))
> AssertionError: '\xe0' not less than 'b'
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue16455>
> _______________________________________
>
msg176434 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2012-11-26 17:38
Victor, any progress on this?
msg176436 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-26 17:56
> Victor, any progress on this?

We have two options, I don't know which one is the best (safer). Does
the terminal handle non-ASCII characters with a C locale on FreeBSD or
Solaris?
msg176869 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-12-04 02:23
New changeset c25635b137cc by Victor Stinner in branch 'default':
Issue #16455: On FreeBSD and Solaris, if the locale is C, the
http://hg.python.org/cpython/rev/c25635b137cc
msg176870 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-12-04 02:30
> We have two options, I don't know which one is the best (safer).

Force ASCII is safer. Python should announce that it does not "understand" non-ASCII bytes on the command line. I also chose this option because isalpha(0xe9) returns 0 (even if mbstowcs(0xe9) returns L"\xe9"): FreeBSD doesn't consider U+00E9 as a letter in the C locale, so Python should also consider this byte as raw data.
msg176871 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-12-04 02:32
> New changeset c25635b137cc by Victor Stinner in branch 'default':
> Issue #16455: On FreeBSD and Solaris, if the locale is C, the
> http://hg.python.org/cpython/rev/c25635b137cc

This changeset should fix #16218 on FreeBSD and Solaris (these OS should now decode correctly undecodable command line arguments).
msg178864 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-01-03 00:24
New changeset c256764e2b3f by Victor Stinner in branch '3.2':
Issue #16455: On FreeBSD and Solaris, if the locale is C, the
http://hg.python.org/cpython/rev/c256764e2b3f

New changeset 5bb289e4fb35 by Victor Stinner in branch '3.3':
(Merge 3.2) Issue #16455: On FreeBSD and Solaris, if the locale is C, the
http://hg.python.org/cpython/rev/5bb289e4fb35
msg178866 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-01-03 00:41
I backported the fix to Python 3.2 and 3.3 because I consider it important enough.
History
Date User Action Args
2022-04-11 14:57:38adminsetgithub: 60659
2013-01-03 01:07:03vstinnersetstatus: open -> closed
resolution: fixed
2013-01-03 00:41:24vstinnersetmessages: + msg178866
versions: + Python 3.2, Python 3.3
2013-01-03 00:24:06python-devsetmessages: + msg178864
2012-12-04 02:32:10vstinnersetmessages: + msg176871
2012-12-04 02:30:31vstinnersetmessages: + msg176870
2012-12-04 02:24:48vstinnersettitle: sys.getfilesystemencoding() is not the locale encoding on FreeBSD and OpenSolaris when the locale is not set -> Decode command line arguments from ASCII on FreeBSD and Solaris if the locale is C
2012-12-04 02:23:05python-devsetnosy: + python-dev
messages: + msg176869
2012-11-26 17:56:28vstinnersetmessages: + msg176436
2012-11-26 17:38:26jceasetmessages: + msg176434
2012-11-12 14:49:40jceasetnosy: + jcea
2012-11-12 14:40:44vstinnersetfiles: + force_ascii.patch

messages: + msg175446
2012-11-11 23:54:58vstinnersetmessages: + msg175410
2012-11-11 23:34:24vstinnersetfiles: + workaround_codeset.patch
keywords: + patch
messages: + msg175408
2012-11-11 22:14:15vstinnercreate