Author vstinner
Recipients amaury.forgeotdarc, loewis, ocean-city, vstinner
Date 2011-10-18.00:15:31
SpamBayes Score 2.22045e-16
Marked as misclassified No
Message-id <1318896943.24.0.1432955931.issue12281@psf.upfronthosting.co.za>
In-reply-to
Content
Version 7 of my patch. This patch is ready for a review: I implemented all TODO.

Summary of the patch (of this issue):

 - fix mbcs encoding to handle correctly ignore & replace error handlers on all Windows version
 - the mbcs encoding now supports any error handler (not only ignore and/or replace to encode/decode)
 - Add codecs.code_page_encode() and codecs.code_page_decode()

With the patch, Python 3.3 will give different results than Python 3.2 with replace and ignore error handlers (which was required to fix bugs). I consider that the new behaviour is more correct than the previous behaviour. It doesn't use Windows "replace" mode which is different than Python "replace" mode.

codecs.code_page_encode() and codecs.code_page_decode() are currently used for unit tests, but they can be used to implement the cp65001 encoding in Python (or any other Windows code page). This encoding is regulary asked for: see issues #6058, #7441 and #10920.

Changes between since the patch version 6:

 - handle multibyte encodings (cp932 and CP_UTF8)
 - the "replace" error handler doesn't use Windows replace and ignore modes. Use Windows strict mode and replace undecodable bytes by '?'. This change removes some differencies between Windows versions (in some corner cases).
 - add more checks for integer overflow
 - add more tests

I only tried my patch on Windows Seven.

--

The codec works byte per byte / character per character if the stringcannot be decoded/encoded in strict mode, so handling errors (with an error handler different than strict) can be slow. I didn't implement optimizations suggested by Martin. Since the patch has a long test suite, it may be possible to implement it later.

--

The patch doesn't expose custom options (MB_COMPOSITE, MB_PRECOMPOSED, MB_USEGLYPHCHARS, ...). I consider that Python builtin error handlers (strict, ignore, replace, backslashreplace, ...) are enough.

--

I deferred my idea of decoding bytes filenames from the ANSI code page in posixmodule.c (and patch Python os.fsencode/fsdecode functions). I should be discussed in another issue.
History
Date User Action Args
2011-10-18 00:15:44vstinnersetrecipients: + vstinner, loewis, amaury.forgeotdarc, ocean-city
2011-10-18 00:15:43vstinnersetmessageid: <1318896943.24.0.1432955931.issue12281@psf.upfronthosting.co.za>
2011-10-18 00:15:42vstinnerlinkissue12281 messages
2011-10-18 00:15:40vstinnercreate