Decode command line arguments from ASCII on FreeBSD and Solaris if the locale is C #60659

vstinner · 2012-11-11T22:14:15Z

BPO	16455
Nosy	@jcea, @vstinner, @ezio-melotti
Files	workaround_codeset.patch force_ascii.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2013-01-03.01:07:03.240>
created_at = <Date 2012-11-11.22:14:15.060>
labels = ['expert-unicode']
title = 'Decode command line arguments from ASCII on FreeBSD and Solaris if the locale is C'
updated_at = <Date 2013-01-03.01:07:03.239>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2013-01-03.01:07:03.239>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2013-01-03.01:07:03.240>
closer = 'vstinner'
components = ['Unicode']
creation = <Date 2012-11-11.22:14:15.060>
creator = 'vstinner'
dependencies = []
files = ['27965', '27970']
hgrepos = []
issue_num = 16455
keywords = ['patch']
message_count = 11.0
messages = ['175401', '175408', '175410', '175446', '176434', '176436', '176869', '176870', '176871', '178864', '178866']
nosy_count = 4.0
nosy_names = ['jcea', 'vstinner', 'ezio.melotti', 'python-dev']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue16455'
versions = ['Python 3.2', 'Python 3.3', 'Python 3.4']

vstinner · 2012-11-11T22:14:15Z

On FreeBSD and OpenIndiana, sys.getfilesystemencoding() is 'ascii' when the locale is not set, whereas the locale encoding is ISO-8859-1.

This inconsistency causes different issue. For example, os.fsencode(sys.argv[1]) fails if the argument is not ASCII because sys.argv are decoded from the locale encoding (by _Py_char2wchar()).

sys.getfilesystemencoding() is 'ascii' because nl_langinfo(CODESET) is used to to get the locale encoding and nl_langinfo(CODESET) announces ASCII (or an alias of this encoding).

Python should detect this case and set sys.getfilesystemencoding() to 'iso8859-1' if the locale encoding is 'iso8859-1' whereas nl_langinfo(CODESET) announces ASCII. We can for example decode b'\xe9' with mbstowcs() and check if it fails or if the result is U+00E9.

vstinner · 2012-11-11T23:34:24Z

Attached patch works around the CODESET issue on OpenIndiana and FreeBSD. If the LC_CTYPE locale is "C" and nl_langinfo(CODESET) returns ASCII (or an alias of this encoding), b"\xE9" is decoded from the locale encoding: if the result is U+00E9, the patch Python uses ISO-8859-1. (If decoding fails, the locale encoding is really ASCII, the workaround is not used.)

If the result is different (b'\xe9' is not decoded from the locale encoding to U+00E9), a ValueError is raised. I wrote this test to detect bugs. I hope that our buildbots will validate the code. We may choose a different behaviour (ex: keep ASCII).

Example on FreeBSD 8.2, original Python 3.4:

$ ./python
>>> import sys, locale
>>> sys.getfilesystemencoding()
'ascii'
>>> locale.getpreferredencoding()
'US-ASCII'

Example on FreeBSD 8.2, patched Python 3.4:

$ ./python 
>>> import sys, locale
>>> sys.getfilesystemencoding()
'iso8859-1'
>>> locale.getpreferredencoding()
'iso8859-1'

vstinner · 2012-11-11T23:54:59Z

Some tests are failing with the patch:

======================================================================
FAIL: test_undecodable_env (test.test_subprocess.POSIXProcessTestCase)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/usr/home/haypo/prog/python/default/Lib/test/test_subprocess.py", line 1606, in test_undecodable_env
    self.assertEqual(stdout.decode('ascii'), ascii(value))
AssertionError: "'abc\\xff'" != "'abc\\udcff'"
- 'abc\xff'
?      ^
+ 'abc\udcff'
?      ^^^

======================================================================
FAIL: test_strcoll_with_diacritic (test.test_locale.TestEnUSCollation)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/usr/home/haypo/prog/python/default/Lib/test/test_locale.py", line 364, in test_strcoll_with_diacritic
    self.assertLess(locale.strcoll('\xe0', 'b'), 0)
AssertionError: 126 not less than 0

======================================================================
FAIL: test_strxfrm_with_diacritic (test.test_locale.TestEnUSCollation)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/usr/home/haypo/prog/python/default/Lib/test/test_locale.py", line 367, in test_strxfrm_with_diacritic
    self.assertLess(locale.strxfrm('\xe0'), locale.strxfrm('b'))
AssertionError: '\xe0' not less than 'b'

vstinner · 2012-11-12T14:40:44Z

Hijacking locale.getpreferredencoding() is maybe dangerous. I attached a
new patch, force_ascii.patch, which uses a different approach: be more
strict than mbstowcs(), force the ASCII encoding when:

the LC_CTYPE locale is C
nl_langinfo(CODESET) is ASCII or an alias of ASCII
mbstowcs() is able to decode non-ASCII characters

2012/11/12 STINNER Victor <report@bugs.python.org>

STINNER Victor added the comment:

Some tests are failing with the patch:

======================================================================
FAIL: test_undecodable_env (test.test_subprocess.POSIXProcessTestCase)
----------------------------------------------------------------------

> Traceback (most recent call last):
>   File "/usr/home/haypo/prog/python/default/Lib/test/test_subprocess.py",
> line 1606, in test_undecodable_env
>     self.assertEqual(stdout.decode('ascii'), ascii(value))
> AssertionError: "'abc\\xff'" != "'abc\\udcff'"
> - 'abc\xff'
> ?      ^
> + 'abc\udcff'
> ?      ^^^
>
>

======================================================================

FAIL: test_strcoll_with_diacritic (test.test_locale.TestEnUSCollation)
----------------------------------------------------------------------

> Traceback (most recent call last):
>   File "/usr/home/haypo/prog/python/default/Lib/test/test_locale.py", line
> 364, in test_strcoll_with_diacritic
>     self.assertLess(locale.strcoll('\xe0', 'b'), 0)
> AssertionError: 126 not less than 0
>
>

======================================================================

FAIL: test_strxfrm_with_diacritic (test.test_locale.TestEnUSCollation)
----------------------------------------------------------------------

> Traceback (most recent call last):
>   File "/usr/home/haypo/prog/python/default/Lib/test/test_locale.py", line
> 367, in test_strxfrm_with_diacritic
>     self.assertLess(locale.strxfrm('\xe0'), locale.strxfrm('b'))
> AssertionError: '\xe0' not less than 'b'
>
>

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue16455\>

jcea · 2012-11-26T17:38:27Z

Victor, any progress on this?

vstinner · 2012-11-26T17:56:28Z

Victor, any progress on this?

We have two options, I don't know which one is the best (safer). Does
the terminal handle non-ASCII characters with a C locale on FreeBSD or
Solaris?

python-dev · 2012-12-04T02:23:05Z

New changeset c25635b137cc by Victor Stinner in branch 'default':
Issue bpo-16455: On FreeBSD and Solaris, if the locale is C, the
http://hg.python.org/cpython/rev/c25635b137cc

vstinner · 2012-12-04T02:30:31Z

We have two options, I don't know which one is the best (safer).

Force ASCII is safer. Python should announce that it does not "understand" non-ASCII bytes on the command line. I also chose this option because isalpha(0xe9) returns 0 (even if mbstowcs(0xe9) returns L"\xe9"): FreeBSD doesn't consider U+00E9 as a letter in the C locale, so Python should also consider this byte as raw data.

vstinner · 2012-12-04T02:32:10Z

New changeset c25635b137cc by Victor Stinner in branch 'default':
Issue bpo-16455: On FreeBSD and Solaris, if the locale is C, the
http://hg.python.org/cpython/rev/c25635b137cc

This changeset should fix bpo-16218 on FreeBSD and Solaris (these OS should now decode correctly undecodable command line arguments).

python-dev · 2013-01-03T00:24:06Z

New changeset c256764e2b3f by Victor Stinner in branch '3.2':
Issue bpo-16455: On FreeBSD and Solaris, if the locale is C, the
http://hg.python.org/cpython/rev/c256764e2b3f

New changeset 5bb289e4fb35 by Victor Stinner in branch '3.3':
(Merge 3.2) Issue bpo-16455: On FreeBSD and Solaris, if the locale is C, the
http://hg.python.org/cpython/rev/5bb289e4fb35

vstinner · 2013-01-03T00:41:24Z

I backported the fix to Python 3.2 and 3.3 because I consider it important enough.

vstinner added the topic-unicode label Nov 11, 2012

vstinner changed the title ~~sys.getfilesystemencoding() is not the locale encoding on FreeBSD and OpenSolaris when the locale is not set~~ Decode command line arguments from ASCII on FreeBSD and Solaris if the locale is C Dec 4, 2012

vstinner closed this as completed Jan 3, 2013

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decode command line arguments from ASCII on FreeBSD and Solaris if the locale is C #60659

Decode command line arguments from ASCII on FreeBSD and Solaris if the locale is C #60659

vstinner commented Nov 11, 2012

vstinner commented Nov 11, 2012

vstinner commented Nov 11, 2012

vstinner commented Nov 11, 2012

vstinner commented Nov 12, 2012

jcea commented Nov 26, 2012

vstinner commented Nov 26, 2012

python-dev mannequin commented Dec 4, 2012

vstinner commented Dec 4, 2012

vstinner commented Dec 4, 2012

python-dev mannequin commented Jan 3, 2013

vstinner commented Jan 3, 2013

Decode command line arguments from ASCII on FreeBSD and Solaris if the locale is C #60659

Decode command line arguments from ASCII on FreeBSD and Solaris if the locale is C #60659

Comments

vstinner commented Nov 11, 2012

vstinner commented Nov 11, 2012

vstinner commented Nov 11, 2012

vstinner commented Nov 11, 2012

vstinner commented Nov 12, 2012

jcea commented Nov 26, 2012

vstinner commented Nov 26, 2012

python-dev mannequin commented Dec 4, 2012

vstinner commented Dec 4, 2012

vstinner commented Dec 4, 2012

python-dev mannequin commented Jan 3, 2013

vstinner commented Jan 3, 2013