Issue 9167: argv double encoding on OSX

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53413

classification

Title:	argv double encoding on OSX
Type:	behavior	Stage:	resolved
Components:	Interpreter Core, macOS, Unicode	Versions:	Python 3.1, Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	ronaldoussoren	Nosy List:	ezio.melotti, piro, r.david.murray, ronaldoussoren, vstinner
Priority:	normal	Keywords:	patch

Created on 2010-07-05 16:07 by piro, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
test-argv.patch	piro, 2010-07-06 09:43

Messages (15)
msg109333 - (view)	Author: Daniele Varrazzo (piro) *	Date: 2010-07-05 16:07
Looks like the wchar_t* array returned by Py_GetArgcArgv() on OSX suffers by double encoding. This can affect sys.argv, sys.executable and C code relying on the above function of course. On Linux: $ python3 Python 3.0rc1+ (py3k, Oct 28 2008, 09:22:29) [GCC 4.3.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import os, sys >>> snowman = '\u2603' >>> os.system(sys.executable + " -c 'import sys; [print(a.encode(\"utf8\")) for a in sys.argv]' foo bar " + snowman) b'-c' b'foo' b'bar' b'\xe2\x98\x83' 0 On OSX (uname -a is Darwin comicbookguy.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386) $ python3 Python 3.1.2 (r312:79147, Jul 5 2010, 11:57:14) [GCC 4.2.1 (Apple Inc. build 5659)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import os, sys >>> snowman = '\u2603' >>> os.system(sys.executable + " -c 'import sys; [print(a.encode(\"utf8\")) for a in sys.argv]' foo bar " + snowman) b'-c' b'foo' b'bar' b'\xc3\xa2\xc2\x98\xc2\x83' 0 Is this a known limitation of the platform? I don't know much about OSX, just found it testing for regressions in setproctitle <http://code.google.com/p/py-setproctitle/> Reported correctly working on Windows.
msg109367 - (view)	Author: Ronald Oussoren (ronaldoussoren) *	Date: 2010-07-06 07:24
I cannot reproduce this with both 3.1.2 and 3.2a (py3k:80693), in both cases I get the same output as you do on Linux. This is on OSX 10.6 though, I haven't tested on 10.4 yet. What is the output of the locale command on your OSX system? Mine says: $ locale LANG="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_ALL= And what is the value of "__CF_USER_TEXT_ENCODING"? My is: $ echo ${__CF_USER_TEXT_ENCODING} 0x1F6:0:0
msg109368 - (view)	Author: Ronald Oussoren (ronaldoussoren) *	Date: 2010-07-06 07:25
BTW. My 3.1 build is: release31-maint:80235M, which is slightly newer that the 3.1.2 release.
msg109377 - (view)	Author: Daniele Varrazzo (piro) *	Date: 2010-07-06 09:43
Attached patch with test cases to check sys.argv and sys.executable. The tests fail against the daily snapshot, so adding python 3.2 to the affected versions. Variable __CF_USER_TEXT_ENCODING is undefined. Locale of the system is C: $ locale LANG= LC_COLLATE="C" LC_CTYPE="C" LC_MESSAGES="C" LC_MONETARY="C" LC_NUMERIC="C" LC_TIME="C" LC_ALL=
msg109386 - (view)	Author: Daniele Varrazzo (piro) *	Date: 2010-07-06 12:16
I've made some other test with LANG=C on other platforms. It seems resulting in a clean error on Linux: $ LANG=C ./here/bin/python3 Python 3.2a0 (py3k, Jul 6 2010, 12:40:29) [GCC 4.3.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys, os >>> snowman = '\u2603' >>> os.system((sys.executable + " -c 'import sys; print(sys.argv[-1].encode(\"utf8\"))' " + snowman).encode(sys.getdefaultencoding())) Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 0: surrogates not allowed 256 Notice that I had to use an explicit encoding or os.system would have tried to encode using ascii and barf, probably because of bug #8775. I've also been reported about issue #4388: I've checked and test_run_code() fails as described. So I think this bug can be considered a #4388 duplicate.
msg111327 - (view)	Author: Ronald Oussoren (ronaldoussoren) *	Date: 2010-07-23 14:17
Daniele: which version of OSX do you use? And if you use OSX 10.5 or 10.6: which is your system language according to system preferences (the topmost entry in the list of the "Language and Text" preference pane, whose icon looks a little like a UN flag. I can only reproduce this by explicitly setting LANG=C before running the test on OSX 10.6 (with English as the main language) This may be very hard to fix. What happens is that subprocess.Popen converts the argument array into the filesystem encoding (which on OSX is always UTF-8). The argv decoder then decodes the using the encoding specified in LANG, which on your system is different from UTF-8. This results in a string where each byte in the UTF-8 encoding of snowman is represented as a single character. Those characters are then encoded as UTF-8 by the test and that results in the error your seeing. That is, the output looks like the output of this code: >>> snowman = '\u2603' >>> snowman.encode('utf-8').decode('latin1').encode('utf-8')
msg111342 - (view)	Author: Ronald Oussoren (ronaldoussoren) *	Date: 2010-07-23 15:01
Daniele: never mind, you already said you are on OSX 10.4. The current behavior is only a problem when the system default encoding as implied by LANG is different from the fileystem encoding. How to fix this is an entirely different question: most (all?) unix tools just work with byte-strings and pass those through unmodified, this means that with something like: subprocess.Popen(['ls', snowman]) The snowman character should be encoded using the filesystem encoding, as that is the bytestring that the C APIs that ls calls expect. Note that encoding using the preferred encoding would result in an exception, as the snowman character cannot be encoded in ASCII or even latin1. A possible workaround is to use the CFStringGetSystemEncoding from CoreFoundation to get the system encoding when LANG=C (and probably guarded by to be activate only on OSX releases before 10.5). Another workaround: upgrade from OSX 10.4 to at least OSX 10.5 ;-)
msg111402 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-07-24 00:01
> This may be very hard to fix I wrote a patch to fix this problem: see #8775.
msg111470 - (view)	Author: Ronald Oussoren (ronaldoussoren) *	Date: 2010-07-24 12:47
Using the CF API to fetch the system encoding won't work: Using PyObjC: >>> CFStringConvertEncodingToIANACharSetName(CFStringGetSystemEncoding()) u'macintosh' There doesn't seem to be another way to extract the prefered encoding from the system. I see two possible resolutions for this issue: * Close as won't fix This is technically a platform issue that has been fixed in OSX 10.5 * Add a workaround that explicitly sets os.environ['LANG'] to 'en_US.UTF-8' before converting argument and environment values to Unicode (only on OSX < 10.4, when LANG=C and of course resetting the previous value after conversion) I have a 10.4 system I could develop this on, but that's currently in a different country than me.
msg111565 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-07-25 22:23
Issue #8622 proposes the creation of an environment variable PYTHONFSENCODING. It will be used to set sys.getfilesystemencoding(). Would it help this issue?
msg111602 - (view)	Author: Daniele Varrazzo (piro) *	Date: 2010-07-26 11:38
Ronald, Thank you for the interest. For me trying to deal with such a tricky issue on a system whose Best Before date is already passed would be a waste of time. I was only interested in factor out the bugs in my extension module from the ones not under my responsibility and I had the bad luck to find a 10.4 to test on. I don't have a direct interest in this bug to be fixed. Thank you very much again for your time.
msg119254 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-10-21 00:54
I just closed #4388 with r85765 (Python 3.2): always use UTF-8 to decode the command line arguments on Mac OS X, not the locale encoding. I suppose that it does fix this issue. Can someone check that?
msg119262 - (view)	Author: Ronald Oussoren (ronaldoussoren) *	Date: 2010-10-21 05:51
Thank you. I'll check, but probably only sometime next week.
msg119358 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-10-22 01:03
rdmurray@buddy:~/python/py3k>uname -a Darwin buddy.home.bitdance.com 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386 rdmurray@buddy:~/python/release31-maint>LC_ALL="C" ./python.exe Python 3.1.2 (release31-maint:85783, Oct 21 2010, 20:31:06) [GCC 4.2.1 (Apple Inc. build 5659)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import os, sys >>> snowman = '\u2603' >>> os.system(sys.executable + " -c 'import sys; [print(a.encode(\"utf8\")) for a in sys.argv]' foo bar " + snowman) b'-c' b'foo' b'bar' b'\xc3\xa2\xc2\x98\xc2\x83' 0 rdmurray@buddy:~/python/py3k>LC_ALL="C" ./python.exe Python 3.2a3+ (py3k:85768, Oct 21 2010, 12:31:12) [GCC 4.2.1 (Apple Inc. build 5659)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import os, sys >>> snowman = '\u2603' >>> os.system(sys.executable + " -c 'import sys; [print(a.encode(\"utf8\")) for a in sys.argv]' foo bar " + snowman) b'-c' b'foo' b'bar' b'\xe2\x98\x83' 0
msg119370 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-10-22 08:58
FYI, you should use ascii() instead of a.encode(\"utf8\") to dump arguments. It's easier to check '\u2603' than b'\xe2\x98\x83' for me :-) So the bug is fixed in Python 3.2, great! I was thinking that we need a test for that, but then I remembered that I already wrote such test :-) My test checks 3 unicode characters: \xe9, \u20ac, \U0010ffff; but also invalid byte sequences: text = ( b'\xff' # invalid byte b'\xc3\xa9' # valid utf-8 character b'\xc3\xff' # invalid byte sequence b'\xed\xa0\x80' # lone surrogate character (invalid) ) And it should be enough :-) See test_osx_utf8() of test_cmd_line to see the whole test.

History
Date	User	Action	Args
2022-04-11 14:57:03	admin	set	github: 53413
2010-10-22 08:58:23	vstinner	set	messages: + msg119370
2010-10-22 01:03:08	r.david.murray	set	status: open -> closed nosy: + r.david.murray messages: + msg119358 resolution: fixed stage: test needed -> resolved
2010-10-21 05:51:14	ronaldoussoren	set	messages: + msg119262
2010-10-21 00:54:28	vstinner	set	messages: + msg119254
2010-07-26 11:38:12	piro	set	messages: + msg111602
2010-07-25 22:23:49	vstinner	set	messages: + msg111565
2010-07-24 12:47:01	ronaldoussoren	set	messages: + msg111470
2010-07-24 00:01:38	vstinner	set	messages: + msg111402
2010-07-23 15:01:24	ronaldoussoren	set	messages: + msg111342
2010-07-23 14:17:22	ronaldoussoren	set	messages: + msg111327
2010-07-06 12:16:51	piro	set	messages: + msg109386
2010-07-06 09:43:12	piro	set	files: + test-argv.patch keywords: + patch messages: + msg109377 versions: + Python 3.2
2010-07-06 07:25:42	ronaldoussoren	set	messages: + msg109368
2010-07-06 07:24:50	ronaldoussoren	set	messages: + msg109367
2010-07-05 16:47:40	ezio.melotti	set	nosy: + ezio.melotti, ronaldoussoren assignee: ronaldoussoren components: + macOS, Unicode stage: test needed
2010-07-05 16:32:48	r.david.murray	set	nosy: + vstinner
2010-07-05 16:07:52	piro	create