Message 84512 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	asvetlov
Recipients	Jukka Aho, amaury.forgeotdarc, asvetlov, benjamin.peterson, vstinner
Date	2009-03-30.05:43:27
SpamBayes Score	6.5231154e-13
Marked as misclassified	No
Message-id	<1238391812.11.0.954566528358.issue4352@psf.upfronthosting.co.za>
In-reply-to

Content
From my understanding (after tracing/debugging) problem lies in import.c find_module tries to convert path from unicode to bytestring using Py_FileSystemDefaultEncoding (line 1397). For Windows it is 'mbcs'. Conversion done with decode_mbcs (unicodeobject.c:4244) what uses MultiByteToWideChar with codepage CP_ACP. Problem is: converting composite characters ('\u00e4' is 'a'+'2 dots over letter', I don't know true name for this sign) this function returns only 'a'. >>> repr('h\u00e4kkinen'.encode('mbcs')) "b'hakkinen'" MSDN says (http://msdn.microsoft.com/en- us/library/dd374130(VS.85).aspx): For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages. Writing encoding function in opposite to PyUnicode_DecodeFSDefault with setting this flag also cannot help - problematic character just replaced with 'default' ('?' if not specified). Hacking specially for 'latin-1' encoding sounds ugly. Changing all filenames to unicode (with possible usage of fileio instead of direct calls of open/fdopen) in import.c looks good for me but takes long time and makes many changes.

From my understanding (after tracing/debugging) problem lies in import.c
find_module tries to convert path from unicode to bytestring using Py_FileSystemDefaultEncoding (line 1397). For Windows it is 'mbcs'.

Conversion done with decode_mbcs (unicodeobject.c:4244) what uses MultiByteToWideChar with codepage CP_ACP. Problem is: converting 
composite characters ('\u00e4' is 'a'+'2 dots over letter', I don't know 
true name for this sign) this function returns only 'a'.

>>> repr('h\u00e4kkinen'.encode('mbcs'))
"b'hakkinen'"

MSDN says (http://msdn.microsoft.com/en-
us/library/dd374130(VS.85).aspx):
For strings that require validation, such as file, resource, and user 
names, the application should always use the WC_NO_BEST_FIT_CHARS flag 
with WideCharToMultiByte. This flag prevents the function from mapping 
characters to characters that appear similar but have very different 
semantics. In some cases, the semantic change can be extreme. For 
example, the symbol for "∞" (infinity) maps to 8 (eight) in some code 
pages.

Writing encoding function in opposite to PyUnicode_DecodeFSDefault with 
setting this flag also cannot help - problematic character just replaced 
with 'default' ('?' if not specified).
Hacking specially for 'latin-1' encoding sounds ugly.

Changing all filenames to unicode (with possible usage of fileio instead 
of direct calls of open/fdopen) in import.c looks good for me but takes 
long time and makes many changes.

History
Date	User	Action	Args
2009-03-30 05:43:33	asvetlov	set	recipients: + asvetlov, amaury.forgeotdarc, vstinner, benjamin.peterson, Jukka Aho
2009-03-30 05:43:32	asvetlov	set	messageid: <1238391812.11.0.954566528358.issue4352@psf.upfronthosting.co.za>
2009-03-30 05:43:30	asvetlov	link	issue4352 messages
2009-03-30 05:43:28	asvetlov	create