Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mbcs encoding ignores errors #39624

Closed
mhammond opened this issue Nov 29, 2003 · 17 comments
Closed

mbcs encoding ignores errors #39624

mhammond opened this issue Nov 29, 2003 · 17 comments
Labels
OS-windows topic-unicode type-feature A feature request or enhancement

Comments

@mhammond
Copy link
Contributor

BPO 850997
Nosy @loewis, @mhammond, @vstinner, @devdanzin, @tjguk, @ezio-melotti
Files
  • mbcs_errors.py: Trivial demo of the bug
  • mbcs_errors.patch: Working patch, but with a few issues
  • translate_reason_unicode.patch
  • mbcs_errors-py3k-3.patch
  • mbcs_errors-py3k-4.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-06-18.23:24:39.064>
    created_at = <Date 2003-11-29.01:24:21.000>
    labels = ['type-feature', 'expert-unicode', 'OS-windows']
    title = 'mbcs encoding ignores errors'
    updated_at = <Date 2010-06-18.23:24:39.057>
    user = 'https://github.com/mhammond'

    bugs.python.org fields:

    activity = <Date 2010-06-18.23:24:39.057>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-06-18.23:24:39.064>
    closer = 'vstinner'
    components = ['Unicode', 'Windows']
    creation = <Date 2003-11-29.01:24:21.000>
    creator = 'mhammond'
    dependencies = []
    files = ['1122', '1123', '17618', '17635', '17690']
    hgrepos = []
    issue_num = 850997
    keywords = ['patch']
    message_count = 17.0
    messages = ['19177', '19178', '19179', '19180', '82015', '82133', '106277', '106278', '106407', '107513', '107517', '107611', '107612', '107957', '107965', '108018', '108149']
    nosy_count = 6.0
    nosy_names = ['loewis', 'mhammond', 'vstinner', 'ajaksu2', 'tim.golden', 'ezio.melotti']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'test needed'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue850997'
    versions = ['Python 3.2']

    @mhammond
    Copy link
    Contributor Author

    The following snippet:

    >>> u'@test-\u5171'.encode("mbcs", "strict")
    '@test-?'

    Should raise a UnicodeError. The errors param is
    completely ignored, and the function always works as
    though errors='replace'.

    Attaching a test case, and the start of a patch. The
    patch has a number of issues:

    • I'm not sure what errors are considered 'mandatory'.
      I have handled 'strict', 'ignore' and 'replace' -
      however, 'ignore' and 'replace' currently are exactly
      the same (ie, replace)
    • The Windows functions don't tell us exactly what
      character failed in the conversion. Thus, the
      exception I raise implies the first character is the
      one that failed. For the same reason, I have made no
      attempt to support error callbacks.

    Comments/guidance appreciated.

    @mhammond
    Copy link
    Contributor Author

    Logged In: YES
    user_id=14198

    Attaching a patch. This patch also attempts to handle
    Encode, but I haven't worked out how to exercise this
    code-path - ie, what mbcs encoded string can I pass that can
    not be converted to unicode?

    As I mentioned, patch has a few issues

    @theller
    Copy link

    theller commented Nov 29, 2003

    Logged In: YES
    user_id=11105

    No idea why this was assigned to me - unicode is certainly
    not one of my strengths.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 1, 2003

    Logged In: YES
    user_id=21627

    The conventional semantics of "ignore" would be "remove the
    failing characters from the output". This would be difficult
    to implement if the Microsoft API provides no detailed error
    indication.

    You could try to get more detailed error indication by
    re-encoding the resulting string with a NULL buffer,
    counting the number of characters that have successfully
    been encoded, atleast in the .decode case.

    In the .encode case, you could try using \0 as the default
    char. To my knowledge, no ACP ever uses \0 in a multi-byte
    string.

    What is the meaning of the WC_DEFAULTCHAR flag, in
    WideCharToMultiByte, and why are you not using it?

    I'm somewhat concerned with backwards compatibility, since
    the mbcs codec has never returned errors. So this should be
    applied to 2.4 only, and listed in whatsnew.tex.

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Feb 14, 2009

    Is this behavior still present? If so, is it still interesting to change it?

    @devdanzin devdanzin mannequin added topic-unicode type-feature A feature request or enhancement labels Feb 14, 2009
    @mhammond
    Copy link
    Contributor Author

    It is still present, but I'm not sure what problems can be seen due to
    this so can't comment on its desirability. It would also introduce a
    backwards compatability concern but I've not enough experience to know
    how much of a problem that would be in practice either.

    @vstinner
    Copy link
    Member

    I patched py3k with mbcs_errors.patch (only encode_mbcs, not the decoder function) and most test pass: I opened bpo-8784 for test_tarfile failure.

    I don't think that it's a problem that mbcs only supports few error handlers, eg. 'strict', 'replace' and 'errors' (but not 'ignore' nor 'surrogateescape'). mbcs should be avoided anyway :-) It is kept for backward compatibility (with Python2). Python3 tries to avoid it by using the Unicode functions of Windows API.

    I don't know exactly where mbcs is still used in Python3. If mbcs becomes more strict and raise new errors, I would like to say that the problem comes from the program, not in the encodig, and the program should be fixed (especilly if the "program" is the Python standard library).

    About the backward compatibility with Python < 3.2: I don't know exactly if this change would be a problem or not. I bet that few people use (directly or indirectly) mbcs with Python 3.1 (on Windows), and few peple (or nobody) would notice this change. And as I wrote, if someone notices a problem: the problem should be fixed in the function using mbcs, not in the codec.

    @vstinner
    Copy link
    Member

    Since this change breaks backward compatibility, it's a very bad idea to change mbcs codec in Python 2.7: remove this version from this issue.

    @vstinner
    Copy link
    Member

    Updated version of the patch for py3k:

    • don't accept "ignore" error handler anymore
    • there is a FIXME near "mbcs_decode_error:"

    The whole test suite pass with these patch.

    @vstinner
    Copy link
    Member

    I worked again on the patch. I opened new issues to prepare the new mbcs codec:

    • bpo-8966: ctypes: remove implicit conversion between unicode and bytes
    • bpo-8967: Create PyErr_GetWindowsMessage() function
    • bpo-8969: Windows: use (mbcs in) strict mode to encode/decode filenames, and enable os.fsencode()

    bpo-8967 can be used to get the translated message of a mbcs encode error. PyErr_GetWindowsMessage() returns a PyUnicodeObject, whereas make_translate_exception() and PyUnicodeTranslateError_SetReason() expect a "char*". Another patch is requied: translate_reason_unicode.patch (attached to this issue, not tested). But I don't think that the message is very important for now :-)

    bpo-8784 (tarfile/Windows: Don't use mbcs as the default encoding) is still open.

    @vstinner
    Copy link
    Member

    New version of the patch:

    • decode_mbcs() calls raise_translate_exception() to set the error (in the previous patch, I'm not sure that the error was set)
    • include bpo-8784 patch (tarfile uses utf-8 as the default encoding)
    • ctypes: use mbcs is strict mode instead of ignore mode. This is just a workaround, the real fix is to remove the implicit conversion between bytes and characters: see bpo-8966

    The patch requires bpo-8969 patch (use mbcs in strict mode to encode/decode filenames).

    @vstinner
    Copy link
    Member

    Tim: are you interested in testing this patch?

    @vstinner
    Copy link
    Member

    Update the patch (I commited the patch on tarfile module): version 3.

    @vstinner
    Copy link
    Member

    Patch version 4:

    • encode_mbcs() uses WC_NO_BEST_FIT_CHARS flag in strict mode. Examples: ğ and ł are not more replaced by g and l
    • encode_mbcs() doesn't set *repr to NULL on encode error: the caller does anyway destroy it
    • write more documentation about mbcs, especially about the error handlers and the changes in Python 3.2

    @vstinner
    Copy link
    Member

    I commited the last patch to py3k: r82037. Let see how the buildbots react :-)

    @tjguk
    Copy link
    Member

    tjguk commented Jun 17, 2010

    I'm unlikely to get to it soon. If there's no urgency I can
    look at it later. FWIW, it's not something I'm especially
    familiar with.

    On 12/06/2010 01:02, STINNER Victor wrote:

    STINNER Victor<victor.stinner@haypocalc.com> added the comment:

    Tim: are you interested in testing this patch?

    ----------
    nosy: +tim.golden


    Python tracker<report@bugs.python.org>
    <http://bugs.python.org/issue850997\>


    @vstinner
    Copy link
    Member

    Close this issue: nothing special on the buildbots.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    OS-windows topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants