New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decode command line arguments from ASCII on FreeBSD and Solaris if the locale is C #60659
Comments
On FreeBSD and OpenIndiana, sys.getfilesystemencoding() is 'ascii' when the locale is not set, whereas the locale encoding is ISO-8859-1. This inconsistency causes different issue. For example, os.fsencode(sys.argv[1]) fails if the argument is not ASCII because sys.argv are decoded from the locale encoding (by _Py_char2wchar()). sys.getfilesystemencoding() is 'ascii' because nl_langinfo(CODESET) is used to to get the locale encoding and nl_langinfo(CODESET) announces ASCII (or an alias of this encoding). Python should detect this case and set sys.getfilesystemencoding() to 'iso8859-1' if the locale encoding is 'iso8859-1' whereas nl_langinfo(CODESET) announces ASCII. We can for example decode b'\xe9' with mbstowcs() and check if it fails or if the result is U+00E9. |
Attached patch works around the CODESET issue on OpenIndiana and FreeBSD. If the LC_CTYPE locale is "C" and nl_langinfo(CODESET) returns ASCII (or an alias of this encoding), b"\xE9" is decoded from the locale encoding: if the result is U+00E9, the patch Python uses ISO-8859-1. (If decoding fails, the locale encoding is really ASCII, the workaround is not used.) If the result is different (b'\xe9' is not decoded from the locale encoding to U+00E9), a ValueError is raised. I wrote this test to detect bugs. I hope that our buildbots will validate the code. We may choose a different behaviour (ex: keep ASCII). Example on FreeBSD 8.2, original Python 3.4: $ ./python
>>> import sys, locale
>>> sys.getfilesystemencoding()
'ascii'
>>> locale.getpreferredencoding()
'US-ASCII' Example on FreeBSD 8.2, patched Python 3.4: $ ./python
>>> import sys, locale
>>> sys.getfilesystemencoding()
'iso8859-1'
>>> locale.getpreferredencoding()
'iso8859-1' |
Some tests are failing with the patch: ====================================================================== Traceback (most recent call last):
File "/usr/home/haypo/prog/python/default/Lib/test/test_subprocess.py", line 1606, in test_undecodable_env
self.assertEqual(stdout.decode('ascii'), ascii(value))
AssertionError: "'abc\\xff'" != "'abc\\udcff'"
- 'abc\xff'
? ^
+ 'abc\udcff'
? ^^^ ====================================================================== Traceback (most recent call last):
File "/usr/home/haypo/prog/python/default/Lib/test/test_locale.py", line 364, in test_strcoll_with_diacritic
self.assertLess(locale.strcoll('\xe0', 'b'), 0)
AssertionError: 126 not less than 0 ====================================================================== Traceback (most recent call last):
File "/usr/home/haypo/prog/python/default/Lib/test/test_locale.py", line 367, in test_strxfrm_with_diacritic
self.assertLess(locale.strxfrm('\xe0'), locale.strxfrm('b'))
AssertionError: '\xe0' not less than 'b' |
Hijacking locale.getpreferredencoding() is maybe dangerous. I attached a
2012/11/12 STINNER Victor <report@bugs.python.org>
> Traceback (most recent call last):
> File "/usr/home/haypo/prog/python/default/Lib/test/test_subprocess.py",
> line 1606, in test_undecodable_env
> self.assertEqual(stdout.decode('ascii'), ascii(value))
> AssertionError: "'abc\\xff'" != "'abc\\udcff'"
> - 'abc\xff'
> ? ^
> + 'abc\udcff'
> ? ^^^
>
> ======================================================================
> Traceback (most recent call last):
> File "/usr/home/haypo/prog/python/default/Lib/test/test_locale.py", line
> 364, in test_strcoll_with_diacritic
> self.assertLess(locale.strcoll('\xe0', 'b'), 0)
> AssertionError: 126 not less than 0
>
> ======================================================================
> Traceback (most recent call last):
> File "/usr/home/haypo/prog/python/default/Lib/test/test_locale.py", line
> 367, in test_strxfrm_with_diacritic
> self.assertLess(locale.strxfrm('\xe0'), locale.strxfrm('b'))
> AssertionError: '\xe0' not less than 'b'
>
>
|
Victor, any progress on this? |
We have two options, I don't know which one is the best (safer). Does |
New changeset c25635b137cc by Victor Stinner in branch 'default': |
Force ASCII is safer. Python should announce that it does not "understand" non-ASCII bytes on the command line. I also chose this option because isalpha(0xe9) returns 0 (even if mbstowcs(0xe9) returns L"\xe9"): FreeBSD doesn't consider U+00E9 as a letter in the C locale, so Python should also consider this byte as raw data. |
This changeset should fix bpo-16218 on FreeBSD and Solaris (these OS should now decode correctly undecodable command line arguments). |
New changeset c256764e2b3f by Victor Stinner in branch '3.2': New changeset 5bb289e4fb35 by Victor Stinner in branch '3.3': |
I backported the fix to Python 3.2 and 3.3 because I consider it important enough. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: