mbcs encoding ignores errors #39624

mhammond · 2003-11-29T01:24:21Z

BPO	850997
Nosy	@loewis, @mhammond, @vstinner, @devdanzin, @tjguk, @ezio-melotti
Files	mbcs_errors.py: Trivial demo of the bug mbcs_errors.patch: Working patch, but with a few issues translate_reason_unicode.patch mbcs_errors-py3k-3.patch mbcs_errors-py3k-4.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-06-18.23:24:39.064>
created_at = <Date 2003-11-29.01:24:21.000>
labels = ['type-feature', 'expert-unicode', 'OS-windows']
title = 'mbcs encoding ignores errors'
updated_at = <Date 2010-06-18.23:24:39.057>
user = 'https://github.com/mhammond'

bugs.python.org fields:

activity = <Date 2010-06-18.23:24:39.057>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2010-06-18.23:24:39.064>
closer = 'vstinner'
components = ['Unicode', 'Windows']
creation = <Date 2003-11-29.01:24:21.000>
creator = 'mhammond'
dependencies = []
files = ['1122', '1123', '17618', '17635', '17690']
hgrepos = []
issue_num = 850997
keywords = ['patch']
message_count = 17.0
messages = ['19177', '19178', '19179', '19180', '82015', '82133', '106277', '106278', '106407', '107513', '107517', '107611', '107612', '107957', '107965', '108018', '108149']
nosy_count = 6.0
nosy_names = ['loewis', 'mhammond', 'vstinner', 'ajaksu2', 'tim.golden', 'ezio.melotti']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'test needed'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue850997'
versions = ['Python 3.2']

mhammond · 2003-11-29T01:24:21Z

The following snippet:

>>> u'@test-\u5171'.encode("mbcs", "strict")
'@test-?'

Should raise a UnicodeError. The errors param is
completely ignored, and the function always works as
though errors='replace'.

Attaching a test case, and the start of a patch. The
patch has a number of issues:

I'm not sure what errors are considered 'mandatory'.
I have handled 'strict', 'ignore' and 'replace' -
however, 'ignore' and 'replace' currently are exactly
the same (ie, replace)
The Windows functions don't tell us exactly what
character failed in the conversion. Thus, the
exception I raise implies the first character is the
one that failed. For the same reason, I have made no
attempt to support error callbacks.

Comments/guidance appreciated.

mhammond · 2003-11-29T01:31:16Z

Logged In: YES
user_id=14198

Attaching a patch. This patch also attempts to handle
Encode, but I haven't worked out how to exercise this
code-path - ie, what mbcs encoded string can I pass that can
not be converted to unicode?

As I mentioned, patch has a few issues

theller · 2003-11-29T15:18:42Z

Logged In: YES
user_id=11105

No idea why this was assigned to me - unicode is certainly
not one of my strengths.

loewis · 2003-12-01T21:25:15Z

Logged In: YES
user_id=21627

The conventional semantics of "ignore" would be "remove the
failing characters from the output". This would be difficult
to implement if the Microsoft API provides no detailed error
indication.

You could try to get more detailed error indication by
re-encoding the resulting string with a NULL buffer,
counting the number of characters that have successfully
been encoded, atleast in the .decode case.

In the .encode case, you could try using \0 as the default
char. To my knowledge, no ACP ever uses \0 in a multi-byte
string.

What is the meaning of the WC_DEFAULTCHAR flag, in
WideCharToMultiByte, and why are you not using it?

I'm somewhat concerned with backwards compatibility, since
the mbcs codec has never returned errors. So this should be
applied to 2.4 only, and listed in whatsnew.tex.

devdanzin · 2009-02-14T11:35:46Z

Is this behavior still present? If so, is it still interesting to change it?

mhammond · 2009-02-14T22:40:36Z

It is still present, but I'm not sure what problems can be seen due to
this so can't comment on its desirability. It would also introduce a
backwards compatability concern but I've not enough experience to know
how much of a problem that would be in practice either.

vstinner · 2010-05-22T01:36:14Z

I patched py3k with mbcs_errors.patch (only encode_mbcs, not the decoder function) and most test pass: I opened bpo-8784 for test_tarfile failure.

I don't think that it's a problem that mbcs only supports few error handlers, eg. 'strict', 'replace' and 'errors' (but not 'ignore' nor 'surrogateescape'). mbcs should be avoided anyway :-) It is kept for backward compatibility (with Python2). Python3 tries to avoid it by using the Unicode functions of Windows API.

I don't know exactly where mbcs is still used in Python3. If mbcs becomes more strict and raise new errors, I would like to say that the problem comes from the program, not in the encodig, and the program should be fixed (especilly if the "program" is the Python standard library).

About the backward compatibility with Python < 3.2: I don't know exactly if this change would be a problem or not. I bet that few people use (directly or indirectly) mbcs with Python 3.1 (on Windows), and few peple (or nobody) would notice this change. And as I wrote, if someone notices a problem: the problem should be fixed in the function using mbcs, not in the codec.

vstinner · 2010-05-22T01:38:50Z

Since this change breaks backward compatibility, it's a very bad idea to change mbcs codec in Python 2.7: remove this version from this issue.

vstinner · 2010-05-24T23:11:36Z

Updated version of the patch for py3k:

don't accept "ignore" error handler anymore
there is a FIXME near "mbcs_decode_error:"

The whole test suite pass with these patch.

vstinner · 2010-06-11T00:54:16Z

I worked again on the patch. I opened new issues to prepare the new mbcs codec:

bpo-8966: ctypes: remove implicit conversion between unicode and bytes
bpo-8967: Create PyErr_GetWindowsMessage() function
bpo-8969: Windows: use (mbcs in) strict mode to encode/decode filenames, and enable os.fsencode()

bpo-8967 can be used to get the translated message of a mbcs encode error. PyErr_GetWindowsMessage() returns a PyUnicodeObject, whereas make_translate_exception() and PyUnicodeTranslateError_SetReason() expect a "char*". Another patch is requied: translate_reason_unicode.patch (attached to this issue, not tested). But I don't think that the message is very important for now :-)

bpo-8784 (tarfile/Windows: Don't use mbcs as the default encoding) is still open.

vstinner · 2010-06-11T01:28:48Z

New version of the patch:

decode_mbcs() calls raise_translate_exception() to set the error (in the previous patch, I'm not sure that the error was set)
include bpo-8784 patch (tarfile uses utf-8 as the default encoding)
ctypes: use mbcs is strict mode instead of ignore mode. This is just a workaround, the real fix is to remove the implicit conversion between bytes and characters: see bpo-8966

The patch requires bpo-8969 patch (use mbcs in strict mode to encode/decode filenames).

vstinner · 2010-06-12T00:01:58Z

Tim: are you interested in testing this patch?

vstinner · 2010-06-12T00:06:51Z

Update the patch (I commited the patch on tarfile module): version 3.

vstinner · 2010-06-16T22:21:02Z

Patch version 4:

encode_mbcs() uses WC_NO_BEST_FIT_CHARS flag in strict mode. Examples: ğ and ł are not more replaced by g and l
encode_mbcs() doesn't set *repr to NULL on encode error: the caller does anyway destroy it
write more documentation about mbcs, especially about the error handlers and the changes in Python 3.2

vstinner · 2010-06-16T23:35:09Z

I commited the last patch to py3k: r82037. Let see how the buildbots react :-)

tjguk · 2010-06-17T14:21:56Z

I'm unlikely to get to it soon. If there's no urgency I can
look at it later. FWIW, it's not something I'm especially
familiar with.

On 12/06/2010 01:02, STINNER Victor wrote:

STINNER Victor<victor.stinner@haypocalc.com> added the comment:

Tim: are you interested in testing this patch?

----------
nosy: +tim.golden

Python tracker<report@bugs.python.org>
<http://bugs.python.org/issue850997\>

vstinner · 2010-06-18T23:24:39Z

Close this issue: nothing special on the buildbots.

mhammond added OS-windows labels Nov 29, 2003

devdanzin mannequin added topic-unicode type-feature A feature request or enhancement labels Feb 14, 2009

vstinner closed this as completed Jun 18, 2010

ezio-melotti transferred this issue from another repository Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mbcs encoding ignores errors #39624

mbcs encoding ignores errors #39624

mhammond commented Nov 29, 2003

mhammond commented Nov 29, 2003

mhammond commented Nov 29, 2003

theller commented Nov 29, 2003

loewis mannequin commented Dec 1, 2003

devdanzin mannequin commented Feb 14, 2009

mhammond commented Feb 14, 2009

vstinner commented May 22, 2010

vstinner commented May 22, 2010

vstinner commented May 24, 2010

vstinner commented Jun 11, 2010

vstinner commented Jun 11, 2010

vstinner commented Jun 12, 2010

vstinner commented Jun 12, 2010

vstinner commented Jun 16, 2010

vstinner commented Jun 16, 2010

tjguk commented Jun 17, 2010

vstinner commented Jun 18, 2010