Message84512
From my understanding (after tracing/debugging) problem lies in import.c
find_module tries to convert path from unicode to bytestring using Py_FileSystemDefaultEncoding (line 1397). For Windows it is 'mbcs'.
Conversion done with decode_mbcs (unicodeobject.c:4244) what uses MultiByteToWideChar with codepage CP_ACP. Problem is: converting
composite characters ('\u00e4' is 'a'+'2 dots over letter', I don't know
true name for this sign) this function returns only 'a'.
>>> repr('h\u00e4kkinen'.encode('mbcs'))
"b'hakkinen'"
MSDN says (http://msdn.microsoft.com/en-
us/library/dd374130(VS.85).aspx):
For strings that require validation, such as file, resource, and user
names, the application should always use the WC_NO_BEST_FIT_CHARS flag
with WideCharToMultiByte. This flag prevents the function from mapping
characters to characters that appear similar but have very different
semantics. In some cases, the semantic change can be extreme. For
example, the symbol for "∞" (infinity) maps to 8 (eight) in some code
pages.
Writing encoding function in opposite to PyUnicode_DecodeFSDefault with
setting this flag also cannot help - problematic character just replaced
with 'default' ('?' if not specified).
Hacking specially for 'latin-1' encoding sounds ugly.
Changing all filenames to unicode (with possible usage of fileio instead
of direct calls of open/fdopen) in import.c looks good for me but takes
long time and makes many changes. |
|
Date |
User |
Action |
Args |
2009-03-30 05:43:33 | asvetlov | set | recipients:
+ asvetlov, amaury.forgeotdarc, vstinner, benjamin.peterson, Jukka Aho |
2009-03-30 05:43:32 | asvetlov | set | messageid: <1238391812.11.0.954566528358.issue4352@psf.upfronthosting.co.za> |
2009-03-30 05:43:30 | asvetlov | link | issue4352 messages |
2009-03-30 05:43:28 | asvetlov | create | |
|