classification
Title: mbcs encoding ignores errors
Type: enhancement Stage: test needed
Components: Unicode, Windows Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ajaksu2, ezio.melotti, loewis, mhammond, tim.golden, vstinner
Priority: normal Keywords: patch

Created on 2003-11-29 01:24 by mhammond, last changed 2010-06-18 23:24 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
mbcs_errors.py mhammond, 2003-11-29 01:24 Trivial demo of the bug
mbcs_errors.patch mhammond, 2003-11-29 01:38 Working patch, but with a few issues
translate_reason_unicode.patch vstinner, 2010-06-11 00:54
mbcs_errors-py3k-3.patch vstinner, 2010-06-12 00:06
mbcs_errors-py3k-4.patch vstinner, 2010-06-16 22:21
Messages (17)
msg19177 - (view) Author: Mark Hammond (mhammond) * (Python committer) Date: 2003-11-29 01:24
The following snippet:

>>> u'@test-\u5171'.encode("mbcs", "strict")
'@test-?'

Should raise a UnicodeError.  The errors param is
completely ignored, and the function always works as
though errors='replace'.

Attaching a test case, and the start of a patch.  The
patch has a number of issues:
* I'm not sure what errors are considered 'mandatory'.
 I have handled 'strict', 'ignore' and 'replace' -
however, 'ignore' and 'replace' currently are exactly
the same (ie, replace)
* The Windows functions don't tell us exactly what
character failed in the conversion.  Thus, the
exception I raise implies the first character is the
one that failed.  For the same reason, I have made no
attempt to support error callbacks.

Comments/guidance appreciated.
msg19178 - (view) Author: Mark Hammond (mhammond) * (Python committer) Date: 2003-11-29 01:31
Logged In: YES 
user_id=14198

Attaching a patch.  This patch also attempts to handle
Encode, but I haven't worked out how to exercise this
code-path - ie, what mbcs encoded string can I pass that can
not be converted to unicode?

As I mentioned, patch has a few issues
msg19179 - (view) Author: Thomas Heller (theller) * (Python committer) Date: 2003-11-29 15:18
Logged In: YES 
user_id=11105

No idea why this was assigned to me - unicode is certainly
not one of my strengths.
msg19180 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2003-12-01 21:25
Logged In: YES 
user_id=21627

The conventional semantics of "ignore" would be "remove the
failing characters from the output". This would be difficult
to implement if the Microsoft API provides no detailed error
indication.

You could try to get more detailed error indication by
re-encoding the resulting string with a NULL buffer,
counting the number of characters that have successfully
been encoded, atleast in the .decode case. 

In the .encode case, you could try using \0 as the default
char. To my knowledge, no ACP ever uses \0 in a multi-byte
string.

What is the meaning of the WC_DEFAULTCHAR flag, in
WideCharToMultiByte, and why are you not using it?

I'm somewhat concerned with backwards compatibility, since
the mbcs codec has never returned errors. So this should be
applied to 2.4 only, and listed in whatsnew.tex.
msg82015 - (view) Author: Daniel Diniz (ajaksu2) (Python triager) Date: 2009-02-14 11:35
Is this behavior still present? If so, is it still interesting to change it?
msg82133 - (view) Author: Mark Hammond (mhammond) * (Python committer) Date: 2009-02-14 22:40
It is still present, but I'm not sure what problems can be seen due to
this so can't comment on its desirability.  It would also introduce a
backwards compatability concern but I've not enough experience to know
how much of a problem that would be in practice either.
msg106277 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-22 01:36
I patched py3k with mbcs_errors.patch (only encode_mbcs, not the decoder function) and most test pass: I opened #8784 for test_tarfile failure.

I don't think that it's a problem that mbcs only supports few error handlers, eg. 'strict', 'replace' and 'errors' (but not 'ignore' nor 'surrogateescape'). mbcs should be avoided anyway :-) It is kept for backward compatibility (with Python2). Python3 tries to avoid it by using the Unicode functions of Windows API.

I don't know exactly where mbcs is still used in Python3. If mbcs becomes more strict and raise new errors, I would like to say that the problem comes from the program, not in the encodig, and the program should be fixed (especilly if the "program" is the Python standard library).

About the backward compatibility with Python < 3.2: I don't know exactly if this change would be a problem or not. I bet that few people use (directly or indirectly) mbcs with Python 3.1 (on Windows), and few peple (or nobody) would notice this change. And as I wrote, if someone notices a problem: the problem should be fixed in the function using mbcs, not in the codec.
msg106278 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-22 01:38
Since this change breaks backward compatibility, it's a very bad idea to change mbcs codec in Python 2.7: remove this version from this issue.
msg106407 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-24 23:11
Updated version of the patch for py3k:
 - don't accept "ignore" error handler anymore
 - there is a FIXME near "mbcs_decode_error:"

The whole test suite pass with these patch.
msg107513 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-11 00:54
I worked again on the patch. I opened new issues to prepare the new mbcs codec:
 - #8966: ctypes: remove implicit conversion between unicode and bytes
 - #8967: Create PyErr_GetWindowsMessage() function
 - #8969: Windows: use (mbcs in) strict mode to encode/decode filenames, and enable os.fsencode()

#8967 can be used to get the translated message of a mbcs encode error. PyErr_GetWindowsMessage() returns a PyUnicodeObject, whereas make_translate_exception() and PyUnicodeTranslateError_SetReason() expect a "char*". Another patch is requied: translate_reason_unicode.patch (attached to this issue, not tested). But I don't think that the message is very important for now :-)

#8784 (tarfile/Windows: Don't use mbcs as the default encoding) is still open.
msg107517 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-11 01:28
New version of the patch:
 - decode_mbcs() calls raise_translate_exception() to set the error (in the previous patch, I'm not sure that the error was set)
 - include #8784 patch (tarfile uses utf-8 as the default encoding)
 - ctypes: use mbcs is strict mode instead of ignore mode. This is just a workaround, the real fix is to remove the implicit conversion between bytes and characters: see #8966

The patch requires #8969 patch (use mbcs in strict mode to encode/decode filenames).
msg107611 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-12 00:01
Tim: are you interested in testing this patch?
msg107612 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-12 00:06
Update the patch (I commited the patch on tarfile module): version 3.
msg107957 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-16 22:21
Patch version 4:
 - encode_mbcs() uses WC_NO_BEST_FIT_CHARS flag in strict mode. Examples: ğ and ł are not more replaced by g and l
 - encode_mbcs() doesn't set *repr to NULL on encode error: the caller does anyway destroy it
 - write more documentation about mbcs, especially about the error handlers and the changes in Python 3.2
msg107965 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-16 23:35
I commited the last patch to py3k: r82037. Let see how the buildbots react :-)
msg108018 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2010-06-17 14:21
I'm unlikely to get to it soon. If there's no urgency I can
look at it later. FWIW, it's not something I'm especially
familiar with.

On 12/06/2010 01:02, STINNER Victor wrote:
>
> STINNER Victor<victor.stinner@haypocalc.com>  added the comment:
>
> Tim: are you interested in testing this patch?
>
> ----------
> nosy: +tim.golden
>
> _______________________________________
> Python tracker<report@bugs.python.org>
> <http://bugs.python.org/issue850997>
> _______________________________________
msg108149 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-18 23:24
Close this issue: nothing special on the buildbots.
History
Date User Action Args
2010-06-18 23:24:39vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg108149
2010-06-17 14:21:56tim.goldensetmessages: + msg108018
2010-06-16 23:35:09vstinnersetmessages: + msg107965
2010-06-16 22:21:04vstinnersetfiles: + mbcs_errors-py3k-4.patch

messages: + msg107957
2010-06-12 00:07:08vstinnersetfiles: - mbcs_errors-py3k-2.patch
2010-06-12 00:07:02vstinnersetfiles: - mbcs_errors-py3k.patch
2010-06-12 00:06:53vstinnersetfiles: + mbcs_errors-py3k-3.patch

messages: + msg107612
2010-06-12 00:01:58vstinnersetnosy: + tim.golden
messages: + msg107611
2010-06-11 01:28:51vstinnersetfiles: + mbcs_errors-py3k-2.patch

messages: + msg107517
2010-06-11 00:54:17vstinnersetfiles: + translate_reason_unicode.patch

messages: + msg107513
2010-05-24 23:11:38vstinnersetfiles: + mbcs_errors-py3k.patch

messages: + msg106407
2010-05-22 01:38:50vstinnersetmessages: + msg106278
versions: - Python 2.7
2010-05-22 01:36:15vstinnersetmessages: + msg106277
2010-05-22 01:01:38vstinnersetnosy: + vstinner
2010-02-05 16:57:18ezio.melottisetnosy: + ezio.melotti

versions: + Python 2.7, Python 3.2
2009-02-14 22:40:35mhammondsetmessages: + msg82133
2009-02-14 12:14:14thellersetnosy: - theller
2009-02-14 11:35:45ajaksu2setnosy: + ajaksu2
messages: + msg82015
components: + Unicode
keywords: + patch
type: enhancement
stage: test needed
2003-11-29 01:24:21mhammondcreate