classification
Title: bytes.decode('mbcs', 'ignore') does replace undecodable bytes on Windows Vista or later
Type: Stage:
Components: Unicode Versions: Python 3.2, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, ezio.melotti, haypo, loewis, ocean-city, python-dev
Priority: normal Keywords: patch

Created on 2011-06-07 21:48 by haypo, last changed 2011-10-26 23:48 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
mbcs7.patch haypo, 2011-10-18 00:15 review
Messages (24)
msg137885 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-07 21:48
Starting at Python 3.2, the MBCS codec uses MultiByteToWideChar() to decode bytes using flags=MB_ERR_INVALID_CHARS by default (strict error handler), flags=0 for the ignore error handler, and raise a ValueError for other error handlers.

The problem is that the meaning of flags=0 changes with the Windows version:

 - ignore undecodable bytes until Windows XP
 - *replace* undecodable bytes for Windows Vista and later

We should accept "replace" error handler with flags=0, at least on Windows Vista and later.

I don't know if we should only accept "ignore" on Windows <= XP and only "error" on Windows >= Vista, or if the difference should be documented.
msg137887 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-07 22:01
MBCS codec was changed by #850997. Martin von Loewis proposed solutions to implement other error handlers in msg19180.
msg137902 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-08 12:47
mbcs.patch fixes PyUnicode_DecodeMBCS():
 - only use flags=0 if errors="replace" on Windows >= Vista or if errors="ignore" on Windows < Vista
 - support any error handler
 - support any code page (but the code page is hardcoded to CP_ACP)

My patch always tries to decode in strict mode. On decode error: it decodes byte per byte, and call unicode_decode_call_errorhandler() on error.

TODO:

 - don't use insize=1 (decode byte per byte): it doesn't work with multibyte encodings (like UTF-8)
 - use final in decode_mbcs_errors(): a multibyte character may be splitted between two chunks of INT_MAX bytes
 - fix all FIXME
 - patch also PyUnicode_EncodeMBCS()
 - implement suggested Martin's optimizations?
 - MB_ERR_INVALID_CHARS is not supported by some code pages (e.g. UTF-7 code page)

Is it necessary to write a NUL character at the end? ("*out = 0;")

It would be nice to support any code page, and maybe support more options (e.g. MB_COMPOSITE, MB_PRECOMPOSED, MB_USEGLYPHCHARS to decode).

It is possible to test different code pages by changing the hardcoded code_page value in PyUnicode_DecodeMBCS. Change your region in the control panel if you would like to change the Windows ANSI code page. You can also play with SetThreadLocale() and CP_THREAD_ACP to test the ANSI code page of the current thread.
msg137903 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-08 12:57
Example with ANSI=cp932 (on Windows Seven):
 - b'abc\xffdef'.decode('mbcs', 'replace') gives 'abc\uf8f3def'
 - b'abc\xffdef'.decode('mbcs', 'ignore') gives 'abcdef'
msg137904 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-08 12:59
> Example with ANSI=cp932 (on Windows Seven):
>  - b'abc\xffdef'.decode('mbcs', 'replace') gives 'abc\uf8f3def'
>  - b'abc\xffdef'.decode('mbcs', 'ignore') gives 'abcdef'

Oh, and b'\xff'.decode('mbcs', 'surrogateescape') gives '\udcff' as expected. At least for surrogateescape, it would be nice that mbcs supports any error handler on encoding.
msg138078 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-10 13:37
Version 2 of my patch (mbcs2.patch):
 - patch also the encoder: fix ignore/replace depending on the Windows version, support any error handler: encode character per character if encoding in strict mode fails
 - Add PyUnicode_DecodeCodePageStateful() and PyUnicode_EncodeCodePage() functions
 - Expose these functions as codecs.code_page_decode() and codecs.code_page_encode()

The encoder raises a RuntimeError("recursive call") (ugly message!) if the result of the error handler is a Unicode string that cannot be encoded to the code page.

More TODO:

 - write tests using codecs.code_page_decode() and codecs.code_page_encode()
 - Fix FIXME (e.g. support surrogates in the encoder)
msg138079 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-10 13:44
Example on Windows Vista with ANSI=cp932:

>>> import codecs
>>> codecs.code_page_encode(1252, '\xe9')
(b'\xe9', 1)
>>> codecs.mbcs_encode('\xe9')
...
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character
>>> codecs.code_page_encode(932, '\xe9')
...
UnicodeEncodeError: 'cp932' codec can't encode characters in position 0--1: invalid character
>>> codecs.code_page_encode(932, '\xe9', 'replace')
(b'e', 1)
>>> codecs.code_page_encode(932, '\xe9', 'ignore')
(b'', 8)
>>> codecs.code_page_encode(932, '\xe9', 'backslashreplace')
(b'\\xe9', 8)

You can use a code page different than the ANSI code page.

The encoding name is generated from the code page number: "cp%u" % code_page, or "mbcs" if code_page == CP_ACP.

(Oops, I forgot a printf() in mbcs2.patch)
msg138080 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-10 13:48
Decode examples, ANSI=cp932:

>>> codecs.code_page_decode(1252, b'\x80')
('\u20ac', 1)
>>> codecs.code_page_decode(932, b'\x82')
...
UnicodeDecodeError: 'mbcs' codec can't decode bytes in position 0--1: No mapping for the Unicode character exists in the target code page.
>>> codecs.code_page_decode(932, b'\x82', 'replace')
('・', 1)
>>> codecs.code_page_decode(932, b'\x82', 'ignore')
('', 0)

Oh, the encoding name is wrong in the decoding errors.
msg138244 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-13 13:52
Patch version 3:
 - add unit tests for code pages 932, 1252, CP_UTF7 and CP_UTF8
 - fix encode/decode flags for CP_UTF7/CP_UTF8
 - fix encode name on UnicodeDecodeError, support also "CP_UTF7" and "CP_UTF8" code page names

TODO:

 - The decoder (with errors) doesn't support multibyte characters, e.g. b"\xC3\xA9\xFF" is not correctly decoded using "replace" (insize is fixed to 1)
 - The encoder doesn't support surrogate pairs, but the result with UTF-8 looks correct
 - UTF-7 decoder is not strict, e.g. b'[+/]' is decoded to '[]' in strict mode
 - UTF-8 encoder is not strict, e.g. replace surrogates by U+FFFD
 - Use final in decode_mbcs_errors(): a multibyte character may be splitted between two chunks of INT_MAX bytes
 - Implement suggested Martin's optimizations?
msg138246 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-13 13:59
Using my patch, it is possible create a codec for any code page on demand: register a function checking if the encoding name starts with "cp" and ends with a valid code page number.

Even if it is bad idea to set the OEM code page to 65001, implement a codec for this code page would solve issue #6058 (and help issues #7441 and #10920). See also issue #1602 (Unicode support of the Windows console).

I don't know if the Windows codec should be use, it available, instead of Python builtin codecs for Windows code pages (e.g. "cp1252" encoding).
msg138407 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-16 00:32
Patch version 4 (mbcs4.patch):
 - fix encode and decode flags depending on the code page and Windows version, e.g. use WC_ERR_INVALID_CHARS instead of WC_NO_BEST_FIT_CHARS for CP_UTF8 on Windows Vista and later
 - fix usage of the default character on encoding, depending on the code page (incompatible with CP_UTF7 and CP_UTF8)
 - add some more unit tests
 - read the windows version only once, at startup
 - decode_code_page_chunk() now adjusts the input size depending on the final flag (it was done by decode_code_page_strict)
msg138478 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-16 23:42
Patch version 5 fixes the encode/decode flags on Windows XP. The codecs give different result on XP and Seven in some cases:

Seven:

- b'\x81\x00abc'.decode('cp932', 'replace') returns '\u30fb\x00abc'
- '\udc80'.encode(CP_UTF8, 'strict') raises UnicodeEncodeError
- b'[\xed\xb2\x80]'.decode(CP_UTF8, 'strict') raises UnicodeEncodeError
- b'[\xed\xb2\x80]'.decode(CP_UTF8, 'ignore') returns '[]'
- b'[\xed\xb2\x80]'.decode(CP_UTF8, 'replace') returns '[\ufffd\ufffd\ufffd]'

XP:

- b'\x81\x00abc'.decode('cp932', 'replace') returns '\x00\x00abc'
- '\udc80'.encode(CP_UTF8, 'strict') returns b'\xed\xb2\x80'
- b'[\xed\xb2\x80]'.decode(CP_UTF8, 'strict') returns '[\udc80]'

These differences come from Windows codecs.
msg138480 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2011-06-16 23:51
What is the use of these code_page_encode() functions?
msg138481 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-16 23:52
TODO: add more tests

CP_UTF8:

        if self.vista_or_later:
            tests.append(('\udc80', 'strict', None))
            tests.append(('\udc80', 'ignore', b''))
            tests.append(('\udc80', 'replace', b'\xef\xbf\xbd'))
        else:
            tests.append(('\udc80', 'strict', b'\xed\xb2\x80'))

cp1252:

            ('\u0141', 'strict', None),
            ('\u0141', 'ignore', b''),
            ('\u0141', 'replace', b'L'),
msg138484 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-17 00:35
> What is the use of these code_page_encode() functions?

I wrote them to be able to write tests.

We can maybe use them to implement the Python code page codecs using a
custom codec register function: see msg138246. Windows codecs seem to be
less reliable/portable than Python builtin codecs, they behave
differently depending on the Windows version. Windows codecs are maybe
faster, I should (write and) run a benchmark.

My main concern is to fix error handling of the Python mbcs codec.

--

I am also trying to factorize the code in posixmodule.c: I would like to
remove the bytes implementation of each function when a function has two
implementations (bytes and Unicode) only for Windows. The idea is to
decode filenames exactly as Windows do and reuse the Unicode
implementation. I don't know yet how Windows do decode bytes filenames
(especially how it handles undecodable bytes), I suppose that it uses
MultiByteToWideChar using cp=CP_ACP and flags=0.

We may patch os.fsdecode() to handle undecodable bytes like Windows
does. codecs.code_page_decode() would help this specific idea, except
that my current patch doesn't allow to specify directly the flags.
"replace" and "ignore" error handlers don't behave as flags=0, or at
least not in some cases. codecs.code_page_decode() should allow to
specific an error handler *or* the flags (mutual exclusive options).

Example:

def fsdecode(filename):
   if isinstance(filename, bytes):
       return codecs.code_page_decode(codecs.CP_ACP, filename, flags=0)
   elif isinstance(filename, str):
       return filename
   else:
       raise TypeError()
msg138490 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2011-06-17 05:49
> I don't know yet how Windows do decode bytes filenames
> (especially how it handles undecodable bytes), 
> I suppose that it uses MultiByteToWideChar using cp=CP_ACP and flags=0.
It's likely, yes.  But you don't need a new codec function for this.
What about something like .decode('mbcs', errors='windows')?

I still don't see the advantage of codecs.code_page_encode().
msg138516 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-17 15:41
> What about something like .decode('mbcs', errors='windows')?

Yes, we can use an error handler specific to the mbcs codec, but I would prefer to not introduce special error handlers.

For os.fsencode(), we can keep it unchanged, or add an optional "flags" argument to codecs.mbcs_decode().

> I still don't see the advantage of codecs.code_page_encode().

codecs.code_page_encode() and codecs.code_page_decode() are required for unit tests. If you don't want to add new public (C and Python) functions, we may add them to _testcapi.
msg145755 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-10-17 19:35
mbcs6.patch: update patch to tip.
msg145767 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-10-18 00:15
Version 7 of my patch. This patch is ready for a review: I implemented all TODO.

Summary of the patch (of this issue):

 - fix mbcs encoding to handle correctly ignore & replace error handlers on all Windows version
 - the mbcs encoding now supports any error handler (not only ignore and/or replace to encode/decode)
 - Add codecs.code_page_encode() and codecs.code_page_decode()

With the patch, Python 3.3 will give different results than Python 3.2 with replace and ignore error handlers (which was required to fix bugs). I consider that the new behaviour is more correct than the previous behaviour. It doesn't use Windows "replace" mode which is different than Python "replace" mode.

codecs.code_page_encode() and codecs.code_page_decode() are currently used for unit tests, but they can be used to implement the cp65001 encoding in Python (or any other Windows code page). This encoding is regulary asked for: see issues #6058, #7441 and #10920.

Changes between since the patch version 6:

 - handle multibyte encodings (cp932 and CP_UTF8)
 - the "replace" error handler doesn't use Windows replace and ignore modes. Use Windows strict mode and replace undecodable bytes by '?'. This change removes some differencies between Windows versions (in some corner cases).
 - add more checks for integer overflow
 - add more tests

I only tried my patch on Windows Seven.

--

The codec works byte per byte / character per character if the stringcannot be decoded/encoded in strict mode, so handling errors (with an error handler different than strict) can be slow. I didn't implement optimizations suggested by Martin. Since the patch has a long test suite, it may be possible to implement it later.

--

The patch doesn't expose custom options (MB_COMPOSITE, MB_PRECOMPOSED, MB_USEGLYPHCHARS, ...). I consider that Python builtin error handlers (strict, ignore, replace, backslashreplace, ...) are enough.

--

I deferred my idea of decoding bytes filenames from the ANSI code page in posixmodule.c (and patch Python os.fsencode/fsdecode functions). I should be discussed in another issue.
msg145858 - (view) Author: Roundup Robot (python-dev) Date: 2011-10-18 19:20
New changeset af0800b986b7 by Victor Stinner in branch 'default':
Issue #12281: Rewrite the MBCS codec to handle correctly replace and ignore
http://hg.python.org/cpython/rev/af0800b986b7
msg145863 - (view) Author: Roundup Robot (python-dev) Date: 2011-10-18 19:45
New changeset 5841920d1ef6 by Victor Stinner in branch 'default':
Issue #12281: Skip code page tests on non-Windows platforms
http://hg.python.org/cpython/rev/5841920d1ef6
msg145865 - (view) Author: Roundup Robot (python-dev) Date: 2011-10-18 19:54
New changeset 413b89242766 by Victor Stinner in branch 'default':
Issue #12281: Fix test_codecs.test_cp932() on Windows XP
http://hg.python.org/cpython/rev/413b89242766
msg145869 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-10-18 21:35
test_codecs pass on Windows XP and Windows Seven buildbots.
msg146468 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-10-26 23:48
I added a cp65001 codec to Python 3.3: see issue #13216.
History
Date User Action Args
2011-10-26 23:48:25hayposetmessages: + msg146468
2011-10-18 21:35:59hayposetstatus: open -> closed
resolution: fixed
messages: + msg145869
2011-10-18 19:54:28python-devsetmessages: + msg145865
2011-10-18 19:45:56python-devsetmessages: + msg145863
2011-10-18 19:20:07python-devsetnosy: + python-dev
messages: + msg145858
2011-10-18 01:03:15ezio.melottisetnosy: + ezio.melotti
2011-10-18 00:21:50hayposetfiles: - mbcs6.patch
2011-10-18 00:15:42hayposetfiles: + mbcs7.patch

messages: + msg145767
2011-10-17 19:36:03hayposetfiles: - mbcs5.patch
2011-10-17 19:36:01hayposetfiles: - mbcs4.patch
2011-10-17 19:35:45hayposetfiles: + mbcs6.patch

messages: + msg145755
2011-06-17 15:41:06hayposetmessages: + msg138516
2011-06-17 05:49:00amaury.forgeotdarcsetmessages: + msg138490
2011-06-17 00:35:34hayposetmessages: + msg138484
2011-06-16 23:52:09hayposetmessages: + msg138481
2011-06-16 23:51:36amaury.forgeotdarcsetmessages: + msg138480
2011-06-16 23:42:34hayposetfiles: + mbcs5.patch

messages: + msg138478
2011-06-16 00:32:37hayposetfiles: - mbcs3.patch
2011-06-16 00:32:33hayposetfiles: - mbcs2.patch
2011-06-16 00:32:30hayposetfiles: - mbcs.patch
2011-06-16 00:32:22hayposetfiles: + mbcs4.patch

messages: + msg138407
2011-06-13 13:59:21hayposetmessages: + msg138246
2011-06-13 13:52:55hayposetfiles: + mbcs3.patch

messages: + msg138244
2011-06-10 13:48:19hayposetmessages: + msg138080
2011-06-10 13:44:49hayposetmessages: + msg138079
2011-06-10 13:37:51hayposetfiles: + mbcs2.patch

messages: + msg138078
2011-06-08 12:59:37hayposetmessages: + msg137904
2011-06-08 12:57:24hayposetnosy: + ocean-city
messages: + msg137903
2011-06-08 12:47:44hayposetfiles: + mbcs.patch
keywords: + patch
messages: + msg137902
2011-06-07 22:02:38hayposetnosy: + loewis
2011-06-07 22:01:22hayposetmessages: + msg137887
2011-06-07 21:48:04haypocreate