Message106474
asvetlov> I'm skeptical about surrogates particularly for that
asvetlov> problem. From my perspective the solution is only to use
asvetlov> native unicode support for windows file operation functions.
It's not exclusive. We can use surrogates on POSIX and then convert to bytes at the system calls, and use the unicode version of the Windows API. In both cases, filenames are unicode.
asvetlov> Conversions utf-8 -> mbcs -> utf8 will loose encoding
asvetlov> information thanks to tricky Microsoft mbcs encoding schema.
asvetlov> If I'm wrong please correct me.
On Windows, Python3 *does* convert unicode to bytes with the mbcs encoding in the import machinery. I tested and Python3 has the same problem on Windows with non decodable filenames than Python3 on Unix. Eg. add "\u0809" character (random non encodable character) to the Python directory name: Python3 doesn't start if the code page cannot encode/decode it.
To fix all OS (Windows and POSIX), Python3 import machinery should not convert filenames to bytes but manipulate unicode characters and only convert filenames to bytes on POSIX at the last moment (at system calls).
--
mbcs codec ignores the error handler: it replaces unknown characters by "?" by default, see #850997. |
|
Date |
User |
Action |
Args |
2010-05-25 20:55:23 | vstinner | set | recipients:
+ vstinner, brett.cannon, pitrou, Arfrever, asvetlov |
2010-05-25 20:55:22 | vstinner | set | messageid: <1274820922.71.0.0886987160071.issue8611@psf.upfronthosting.co.za> |
2010-05-25 20:55:20 | vstinner | link | issue8611 messages |
2010-05-25 20:55:20 | vstinner | create | |
|