Author vstinner
Recipients Arfrever, asvetlov, brett.cannon, pitrou, vstinner
Date 2010-05-25.20:55:20
SpamBayes Score 0.00623078
Marked as misclassified No
Message-id <1274820922.71.0.0886987160071.issue8611@psf.upfronthosting.co.za>
In-reply-to
Content
asvetlov> I'm skeptical about surrogates particularly for that 
asvetlov> problem. From my perspective the solution is only to use 
asvetlov> native unicode support for windows file operation functions.

It's not exclusive. We can use surrogates on POSIX and then convert to bytes at the system calls, and use the unicode version of the Windows API. In both cases, filenames are unicode.

asvetlov> Conversions utf-8 -> mbcs -> utf8 will loose encoding
asvetlov> information thanks to tricky Microsoft mbcs encoding schema.
asvetlov> If I'm wrong please correct me.

On Windows, Python3 *does* convert unicode to bytes with the mbcs encoding in the import machinery. I tested and Python3 has the same problem on Windows with non decodable filenames than Python3 on Unix. Eg. add "\u0809" character (random non encodable character) to the Python directory name: Python3 doesn't start if the code page cannot encode/decode it.

To fix all OS (Windows and POSIX), Python3 import machinery should not convert filenames to bytes but manipulate unicode characters and only convert filenames to bytes on POSIX at the last moment (at system calls).

--

mbcs codec ignores the error handler: it replaces unknown characters by "?" by default, see #850997.
History
Date User Action Args
2010-05-25 20:55:23vstinnersetrecipients: + vstinner, brett.cannon, pitrou, Arfrever, asvetlov
2010-05-25 20:55:22vstinnersetmessageid: <1274820922.71.0.0886987160071.issue8611@psf.upfronthosting.co.za>
2010-05-25 20:55:20vstinnerlinkissue8611 messages
2010-05-25 20:55:20vstinnercreate