New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bytes.decode('mbcs', 'ignore') does replace undecodable bytes on Windows Vista or later #56490
Comments
Starting at Python 3.2, the MBCS codec uses MultiByteToWideChar() to decode bytes using flags=MB_ERR_INVALID_CHARS by default (strict error handler), flags=0 for the ignore error handler, and raise a ValueError for other error handlers. The problem is that the meaning of flags=0 changes with the Windows version:
We should accept "replace" error handler with flags=0, at least on Windows Vista and later. I don't know if we should only accept "ignore" on Windows <= XP and only "error" on Windows >= Vista, or if the difference should be documented. |
MBCS codec was changed by bpo-850997. Martin von Loewis proposed solutions to implement other error handlers in msg19180. |
mbcs.patch fixes PyUnicode_DecodeMBCS():
My patch always tries to decode in strict mode. On decode error: it decodes byte per byte, and call unicode_decode_call_errorhandler() on error. TODO:
Is it necessary to write a NUL character at the end? ("*out = 0;") It would be nice to support any code page, and maybe support more options (e.g. MB_COMPOSITE, MB_PRECOMPOSED, MB_USEGLYPHCHARS to decode). It is possible to test different code pages by changing the hardcoded code_page value in PyUnicode_DecodeMBCS. Change your region in the control panel if you would like to change the Windows ANSI code page. You can also play with SetThreadLocale() and CP_THREAD_ACP to test the ANSI code page of the current thread. |
Example with ANSI=cp932 (on Windows Seven):
|
Oh, and b'\xff'.decode('mbcs', 'surrogateescape') gives '\udcff' as expected. At least for surrogateescape, it would be nice that mbcs supports any error handler on encoding. |
Version 2 of my patch (mbcs2.patch):
The encoder raises a RuntimeError("recursive call") (ugly message!) if the result of the error handler is a Unicode string that cannot be encoded to the code page. More TODO:
|
Example on Windows Vista with ANSI=cp932: >>> import codecs
>>> codecs.code_page_encode(1252, '\xe9')
(b'\xe9', 1)
>>> codecs.mbcs_encode('\xe9')
...
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character
>>> codecs.code_page_encode(932, '\xe9')
...
UnicodeEncodeError: 'cp932' codec can't encode characters in position 0--1: invalid character
>>> codecs.code_page_encode(932, '\xe9', 'replace')
(b'e', 1)
>>> codecs.code_page_encode(932, '\xe9', 'ignore')
(b'', 8)
>>> codecs.code_page_encode(932, '\xe9', 'backslashreplace')
(b'\\xe9', 8) You can use a code page different than the ANSI code page. The encoding name is generated from the code page number: "cp%u" % code_page, or "mbcs" if code_page == CP_ACP. (Oops, I forgot a printf() in mbcs2.patch) |
Decode examples, ANSI=cp932: >>> codecs.code_page_decode(1252, b'\x80')
('\u20ac', 1)
>>> codecs.code_page_decode(932, b'\x82')
...
UnicodeDecodeError: 'mbcs' codec can't decode bytes in position 0--1: No mapping for the Unicode character exists in the target code page.
>>> codecs.code_page_decode(932, b'\x82', 'replace')
('・', 1)
>>> codecs.code_page_decode(932, b'\x82', 'ignore')
('', 0) Oh, the encoding name is wrong in the decoding errors. |
Patch version 3:
TODO:
|
Using my patch, it is possible create a codec for any code page on demand: register a function checking if the encoding name starts with "cp" and ends with a valid code page number. Even if it is bad idea to set the OEM code page to 65001, implement a codec for this code page would solve issue bpo-6058 (and help issues bpo-7441 and bpo-10920). See also issue bpo-1602 (Unicode support of the Windows console). I don't know if the Windows codec should be use, it available, instead of Python builtin codecs for Windows code pages (e.g. "cp1252" encoding). |
Patch version 4 (mbcs4.patch):
|
Patch version 5 fixes the encode/decode flags on Windows XP. The codecs give different result on XP and Seven in some cases: Seven:
XP:
These differences come from Windows codecs. |
What is the use of these code_page_encode() functions? |
TODO: add more tests CP_UTF8: if self.vista_or_later:
tests.append(('\udc80', 'strict', None))
tests.append(('\udc80', 'ignore', b''))
tests.append(('\udc80', 'replace', b'\xef\xbf\xbd'))
else:
tests.append(('\udc80', 'strict', b'\xed\xb2\x80')) cp1252:
|
I wrote them to be able to write tests. We can maybe use them to implement the Python code page codecs using a My main concern is to fix error handling of the Python mbcs codec. -- I am also trying to factorize the code in posixmodule.c: I would like to We may patch os.fsdecode() to handle undecodable bytes like Windows Example: def fsdecode(filename):
if isinstance(filename, bytes):
return codecs.code_page_decode(codecs.CP_ACP, filename, flags=0)
elif isinstance(filename, str):
return filename
else:
raise TypeError() |
I still don't see the advantage of codecs.code_page_encode(). |
Yes, we can use an error handler specific to the mbcs codec, but I would prefer to not introduce special error handlers. For os.fsencode(), we can keep it unchanged, or add an optional "flags" argument to codecs.mbcs_decode().
codecs.code_page_encode() and codecs.code_page_decode() are required for unit tests. If you don't want to add new public (C and Python) functions, we may add them to _testcapi. |
mbcs6.patch: update patch to tip. |
Version 7 of my patch. This patch is ready for a review: I implemented all TODO. Summary of the patch (of this issue):
With the patch, Python 3.3 will give different results than Python 3.2 with replace and ignore error handlers (which was required to fix bugs). I consider that the new behaviour is more correct than the previous behaviour. It doesn't use Windows "replace" mode which is different than Python "replace" mode. codecs.code_page_encode() and codecs.code_page_decode() are currently used for unit tests, but they can be used to implement the cp65001 encoding in Python (or any other Windows code page). This encoding is regulary asked for: see issues bpo-6058, bpo-7441 and bpo-10920. Changes between since the patch version 6:
I only tried my patch on Windows Seven. -- The codec works byte per byte / character per character if the stringcannot be decoded/encoded in strict mode, so handling errors (with an error handler different than strict) can be slow. I didn't implement optimizations suggested by Martin. Since the patch has a long test suite, it may be possible to implement it later. -- The patch doesn't expose custom options (MB_COMPOSITE, MB_PRECOMPOSED, MB_USEGLYPHCHARS, ...). I consider that Python builtin error handlers (strict, ignore, replace, backslashreplace, ...) are enough. -- I deferred my idea of decoding bytes filenames from the ANSI code page in posixmodule.c (and patch Python os.fsencode/fsdecode functions). I should be discussed in another issue. |
New changeset af0800b986b7 by Victor Stinner in branch 'default': |
New changeset 5841920d1ef6 by Victor Stinner in branch 'default': |
New changeset 413b89242766 by Victor Stinner in branch 'default': |
test_codecs pass on Windows XP and Windows Seven buildbots. |
I added a cp65001 codec to Python 3.3: see issue bpo-13216. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: