This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: argv double encoding on OSX
Type: behavior Stage: resolved
Components: Interpreter Core, macOS, Unicode Versions: Python 3.1, Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ronaldoussoren Nosy List: ezio.melotti, piro, r.david.murray, ronaldoussoren, vstinner
Priority: normal Keywords: patch

Created on 2010-07-05 16:07 by piro, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
test-argv.patch piro, 2010-07-06 09:43
Messages (15)
msg109333 - (view) Author: Daniele Varrazzo (piro) * Date: 2010-07-05 16:07
Looks like the wchar_t* array returned by Py_GetArgcArgv() on OSX suffers by double encoding. This can affect sys.argv, sys.executable and C code relying on the above function of course.

On Linux:

$ python3
Python 3.0rc1+ (py3k, Oct 28 2008, 09:22:29) 
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, sys
>>> snowman = '\u2603'
>>> os.system(sys.executable + " -c 'import sys; [print(a.encode(\"utf8\")) for a in sys.argv]' foo bar " + snowman)
b'-c'
b'foo'
b'bar'
b'\xe2\x98\x83'
0

On OSX (uname -a is Darwin comicbookguy.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386)

$ python3
Python 3.1.2 (r312:79147, Jul  5 2010, 11:57:14) 
[GCC 4.2.1 (Apple Inc. build 5659)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, sys
>>> snowman = '\u2603'
>>> os.system(sys.executable + " -c 'import sys; [print(a.encode(\"utf8\")) for a in sys.argv]' foo bar " + snowman)
b'-c'
b'foo'
b'bar'
b'\xc3\xa2\xc2\x98\xc2\x83'
0

Is this a known limitation of the platform? I don't know much about OSX, just found it testing for regressions in setproctitle <http://code.google.com/p/py-setproctitle/>

Reported correctly working on Windows.
msg109367 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2010-07-06 07:24
I cannot reproduce this with both 3.1.2 and 3.2a (py3k:80693), in both cases I get the same output as you do on Linux.  This is on OSX 10.6 though, I haven't tested on 10.4 yet.

What is the output of the locale command on your OSX system? Mine says:


$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

And what is the value of "__CF_USER_TEXT_ENCODING"? My is:

$ echo ${__CF_USER_TEXT_ENCODING}
0x1F6:0:0
msg109368 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2010-07-06 07:25
BTW. My 3.1 build is: release31-maint:80235M, which is slightly newer that the 3.1.2 release.
msg109377 - (view) Author: Daniele Varrazzo (piro) * Date: 2010-07-06 09:43
Attached patch with test cases to check sys.argv and sys.executable.

The tests fail against the daily snapshot, so adding python 3.2 to the affected versions.

Variable __CF_USER_TEXT_ENCODING is undefined. Locale of the system is C:

$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
msg109386 - (view) Author: Daniele Varrazzo (piro) * Date: 2010-07-06 12:16
I've made some other test with LANG=C on other platforms. It seems resulting in a clean error on Linux:

$ LANG=C ./here/bin/python3
Python 3.2a0 (py3k, Jul  6 2010, 12:40:29) 
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys, os
>>> snowman = '\u2603'
>>> os.system((sys.executable + " -c 'import sys; print(sys.argv[-1].encode(\"utf8\"))' " + snowman).encode(sys.getdefaultencoding()))
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 0: surrogates not allowed
256

Notice that I had to use an explicit encoding or os.system would have tried to encode using ascii and barf, probably because of bug #8775.

I've also been reported about issue #4388: I've checked and test_run_code() fails as described. So I think this bug can be considered a #4388 duplicate.
msg111327 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2010-07-23 14:17
Daniele: which version of OSX do you use?  And if you use OSX 10.5 or 10.6: which is your system language according to system preferences (the topmost entry in the list of the "Language and Text" preference pane, whose icon looks a little like a UN flag.

I can only reproduce this by explicitly setting LANG=C before running the test on OSX 10.6 (with English as the main language)

This may be very hard to fix. What happens is that subprocess.Popen converts the argument array into the filesystem encoding (which on OSX is always UTF-8). The argv decoder then decodes the using the encoding specified in LANG, which on your system is different from UTF-8. This results in a string where each byte in the UTF-8 encoding of snowman is represented as a single character. Those characters are then encoded as UTF-8 by the test and that results in the error your seeing.

That is, the output looks like the output of this code:

>>> snowman = '\u2603'
>>> snowman.encode('utf-8').decode('latin1').encode('utf-8')
msg111342 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2010-07-23 15:01
Daniele: never mind, you already said you are on OSX 10.4.

The current behavior is only a problem when the system default encoding as implied by LANG is different from the fileystem encoding.

How to fix this is an entirely different question: most (all?) unix tools just work with byte-strings and pass those through unmodified, this means that with something like:

   subprocess.Popen(['ls', snowman])

The snowman character should be encoded using the filesystem encoding, as that is the bytestring that the C APIs that ls calls expect.

Note that encoding using the preferred encoding would result in an exception, as the snowman character cannot be encoded in ASCII or even latin1.

A possible workaround is to use the CFStringGetSystemEncoding from CoreFoundation to get the system encoding when LANG=C (and probably guarded by to be activate only on OSX releases before 10.5).

Another workaround: upgrade from OSX 10.4 to at least OSX 10.5 ;-)
msg111402 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-07-24 00:01
> This may be very hard to fix

I wrote a patch to fix this problem: see #8775.
msg111470 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2010-07-24 12:47
Using the CF API to fetch the system encoding won't work:

Using PyObjC:
>>> CFStringConvertEncodingToIANACharSetName(CFStringGetSystemEncoding())
u'macintosh'

There doesn't seem to be another way to extract the prefered encoding from the system.

I see two possible resolutions for this issue:

* Close as won't fix
  This is technically a platform issue that has been fixed in OSX 10.5

* Add a workaround that explicitly sets os.environ['LANG'] to
  'en_US.UTF-8' before converting argument and environment values
  to Unicode (only on OSX < 10.4, when LANG=C and of course resetting
  the previous value after conversion)

I have a 10.4 system I could develop this on, but that's currently in a different country than me.
msg111565 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-07-25 22:23
Issue #8622 proposes the creation of an environment variable PYTHONFSENCODING. It will be used to set sys.getfilesystemencoding(). Would it help this issue?
msg111602 - (view) Author: Daniele Varrazzo (piro) * Date: 2010-07-26 11:38
Ronald,

Thank you for the interest. For me trying to deal with such a tricky issue on a system whose Best Before date is already passed would be a waste of time.

I was only interested in factor out the bugs in my extension module from the ones not under my responsibility and I had the bad luck to find a 10.4 to test on. I don't have a direct interest in this bug to be fixed.

Thank you very much again for your time.
msg119254 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-21 00:54
I just closed #4388 with r85765 (Python 3.2): always use UTF-8 to decode the command line arguments on Mac OS X, not the locale encoding.

I suppose that it does fix this issue. Can someone check that?
msg119262 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2010-10-21 05:51
Thank you. I'll check, but probably only sometime next week.
msg119358 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-22 01:03
rdmurray@buddy:~/python/py3k>uname -a
Darwin buddy.home.bitdance.com 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386


rdmurray@buddy:~/python/release31-maint>LC_ALL="C" ./python.exe
Python 3.1.2 (release31-maint:85783, Oct 21 2010, 20:31:06)
[GCC 4.2.1 (Apple Inc. build 5659)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, sys
>>> snowman = '\u2603'
>>> os.system(sys.executable + " -c 'import sys; [print(a.encode(\"utf8\")) for a in sys.argv]' foo bar " + snowman)
b'-c'
b'foo'
b'bar'
b'\xc3\xa2\xc2\x98\xc2\x83'
0


rdmurray@buddy:~/python/py3k>LC_ALL="C" ./python.exe 
Python 3.2a3+ (py3k:85768, Oct 21 2010, 12:31:12) 
[GCC 4.2.1 (Apple Inc. build 5659)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, sys
>>> snowman = '\u2603'
>>> os.system(sys.executable + " -c 'import sys; [print(a.encode(\"utf8\")) for a in sys.argv]' foo bar " + snowman)
b'-c'
b'foo'
b'bar'
b'\xe2\x98\x83'
0
msg119370 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-22 08:58
FYI, you should use ascii() instead of a.encode(\"utf8\") to dump arguments. It's easier to check '\u2603' than b'\xe2\x98\x83' for me :-)

So the bug is fixed in Python 3.2, great! I was thinking that we need a test for that, but then I remembered that I already wrote such test :-) My test checks 3 unicode characters: \xe9, \u20ac, \U0010ffff; but also invalid byte sequences:

text = (
  b'\xff'         # invalid byte
  b'\xc3\xa9'     # valid utf-8 character
  b'\xc3\xff'     # invalid byte sequence
  b'\xed\xa0\x80' # lone surrogate character (invalid)
)

And it should be enough :-) See test_osx_utf8() of test_cmd_line to see the whole test.
History
Date User Action Args
2022-04-11 14:57:03adminsetgithub: 53413
2010-10-22 08:58:23vstinnersetmessages: + msg119370
2010-10-22 01:03:08r.david.murraysetstatus: open -> closed

nosy: + r.david.murray
messages: + msg119358

resolution: fixed
stage: test needed -> resolved
2010-10-21 05:51:14ronaldoussorensetmessages: + msg119262
2010-10-21 00:54:28vstinnersetmessages: + msg119254
2010-07-26 11:38:12pirosetmessages: + msg111602
2010-07-25 22:23:49vstinnersetmessages: + msg111565
2010-07-24 12:47:01ronaldoussorensetmessages: + msg111470
2010-07-24 00:01:38vstinnersetmessages: + msg111402
2010-07-23 15:01:24ronaldoussorensetmessages: + msg111342
2010-07-23 14:17:22ronaldoussorensetmessages: + msg111327
2010-07-06 12:16:51pirosetmessages: + msg109386
2010-07-06 09:43:12pirosetfiles: + test-argv.patch
keywords: + patch
messages: + msg109377

versions: + Python 3.2
2010-07-06 07:25:42ronaldoussorensetmessages: + msg109368
2010-07-06 07:24:50ronaldoussorensetmessages: + msg109367
2010-07-05 16:47:40ezio.melottisetnosy: + ezio.melotti, ronaldoussoren

assignee: ronaldoussoren
components: + macOS, Unicode
stage: test needed
2010-07-05 16:32:48r.david.murraysetnosy: + vstinner
2010-07-05 16:07:52pirocreate