Message 106474 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	Arfrever, asvetlov, brett.cannon, pitrou, vstinner
Date	2010-05-25.20:55:20
SpamBayes Score	0.006230782
Marked as misclassified	No
Message-id	<1274820922.71.0.0886987160071.issue8611@psf.upfronthosting.co.za>
In-reply-to

Content
asvetlov> I'm skeptical about surrogates particularly for that asvetlov> problem. From my perspective the solution is only to use asvetlov> native unicode support for windows file operation functions. It's not exclusive. We can use surrogates on POSIX and then convert to bytes at the system calls, and use the unicode version of the Windows API. In both cases, filenames are unicode. asvetlov> Conversions utf-8 -> mbcs -> utf8 will loose encoding asvetlov> information thanks to tricky Microsoft mbcs encoding schema. asvetlov> If I'm wrong please correct me. On Windows, Python3 does convert unicode to bytes with the mbcs encoding in the import machinery. I tested and Python3 has the same problem on Windows with non decodable filenames than Python3 on Unix. Eg. add "\u0809" character (random non encodable character) to the Python directory name: Python3 doesn't start if the code page cannot encode/decode it. To fix all OS (Windows and POSIX), Python3 import machinery should not convert filenames to bytes but manipulate unicode characters and only convert filenames to bytes on POSIX at the last moment (at system calls). -- mbcs codec ignores the error handler: it replaces unknown characters by "?" by default, see #850997.

asvetlov> I'm skeptical about surrogates particularly for that 
asvetlov> problem. From my perspective the solution is only to use 
asvetlov> native unicode support for windows file operation functions.

It's not exclusive. We can use surrogates on POSIX and then convert to bytes at the system calls, and use the unicode version of the Windows API. In both cases, filenames are unicode.

asvetlov> Conversions utf-8 -> mbcs -> utf8 will loose encoding
asvetlov> information thanks to tricky Microsoft mbcs encoding schema.
asvetlov> If I'm wrong please correct me.

On Windows, Python3 *does* convert unicode to bytes with the mbcs encoding in the import machinery. I tested and Python3 has the same problem on Windows with non decodable filenames than Python3 on Unix. Eg. add "\u0809" character (random non encodable character) to the Python directory name: Python3 doesn't start if the code page cannot encode/decode it.

To fix all OS (Windows and POSIX), Python3 import machinery should not convert filenames to bytes but manipulate unicode characters and only convert filenames to bytes on POSIX at the last moment (at system calls).

--

mbcs codec ignores the error handler: it replaces unknown characters by "?" by default, see #850997.

History
Date	User	Action	Args
2010-05-25 20:55:23	vstinner	set	recipients: + vstinner, brett.cannon, pitrou, Arfrever, asvetlov
2010-05-25 20:55:22	vstinner	set	messageid: <1274820922.71.0.0886987160071.issue8611@psf.upfronthosting.co.za>
2010-05-25 20:55:20	vstinner	link	issue8611 messages
2010-05-25 20:55:20	vstinner	create