Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

patch for mbcs codecs #43070

Closed
ocean-city mannequin opened this issue Mar 22, 2006 · 29 comments
Closed

patch for mbcs codecs #43070

ocean-city mannequin opened this issue Mar 22, 2006 · 29 comments

Comments

@ocean-city
Copy link
Mannequin

ocean-city mannequin commented Mar 22, 2006

BPO 1455898
Nosy @malemburg, @loewis, @doerwalter
Files
  • a.zip: the files to reproduce the problem
  • b.txt: broken result from a.zip
  • test_mbcs.py: test (place this in lib/test)
  • mbcs.patch: patch (support 64bit ssize_tenvironment)
  • fix.zip: addtional patch + test
  • fix.patch: addtional patch (version2)
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2006-08-02.13:54:46.000>
    created_at = <Date 2006-03-22.07:31:46.000>
    labels = ['OS-windows']
    title = 'patch for mbcs codecs'
    updated_at = <Date 2006-08-02.13:54:46.000>
    user = 'https://bugs.python.org/ocean-city'

    bugs.python.org fields:

    activity = <Date 2006-08-02.13:54:46.000>
    actor = 'loewis'
    assignee = 'none'
    closed = True
    closed_date = None
    closer = None
    components = ['Windows']
    creation = <Date 2006-03-22.07:31:46.000>
    creator = 'ocean-city'
    dependencies = []
    files = ['7091', '7092', '7093', '7094', '7095', '7096']
    hgrepos = []
    issue_num = 1455898
    keywords = ['patch']
    message_count = 29.0
    messages = ['49812', '49813', '49814', '49815', '49816', '49817', '49818', '49819', '49820', '49821', '49822', '49823', '49824', '49825', '49826', '49827', '49828', '49829', '49830', '49831', '49832', '49833', '49834', '49835', '49836', '49837', '49838', '49839', '49840']
    nosy_count = 5.0
    nosy_names = ['lemburg', 'loewis', 'doerwalter', 'ocean-city', 'jwnmulder']
    pr_nums = []
    priority = 'high'
    resolution = 'accepted'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue1455898'
    versions = ['Python 2.5']

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Mar 22, 2006

    Hello.

    I have noticed mbcs codecs sometimes generates broken
    string. I'm using Windows(Japanese) so mbcs is mapped
    to cp932 (close to shift_jis)

    When I run the attached script "a.zip", the entry
    "Error 00007"'s message becomes broken like attached
    file "b.txt".

    I think this happens because the string passed to
    PyUnicode_DecodeMBCS() sometimes terminates with
    leading byte, and MultiByteToWideChar() counts it for
    size of result string.buffer size.

    I hope attached patch "mbcs.patch" may fix the problem.
    It would be nice if this bug will be fixed in 2.4.3...
    Thank you.

    @ocean-city ocean-city mannequin closed this as completed Mar 22, 2006
    @ocean-city ocean-city mannequin added the OS-windows label Mar 22, 2006
    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Mar 22, 2006

    Logged In: YES
    user_id=1200846

    I forgot to mention this. "mbcs.patch" is for
    release24-maint branch.

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    As I understand your comment, the mbcs codec will have a
    problem if the input string terminates with a lead byte.

    Could you add a comment regarding this to the patch ?!

    I can't test the patch, since I don't have a Japanese
    Windows to check on, but from looking at the patch, it seems OK.

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    One more nit: the doc patch is missing. Please add a patch
    for the API docs.

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Mar 22, 2006

    Logged In: YES
    user_id=1200846

    Thank you for reply. How about this? (I'm a newbie, I hope
    this is right tex format but... can you confirm this? I
    created this patch by copy & paste from
    PyUnicode_DecodeUTF16Stateful and some modification)

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Mar 22, 2006

    Logged In: YES
    user_id=1200846

    Sorry, I found problem when tried more long text file...
    Please wait. I'll investigate more intensibly.

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Mar 22, 2006

    Logged In: YES
    user_id=1200846

    Sorry, I was stupid.

    MSDN
    (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_0o2t.asp)
    saids,

    IsDBCSLeadByte can only indicate a potential lead byte value.

    IsDBCSLeadByte was returning 1 for some trail byte (ex: "歴"[1])

    The patch "mbcs3a.patch" worked for me, but _mbsbtype is
    probably compiler specific. Is that OK?

    The patch "mbcs3b.patch" also worked for me and it only uses
    Win32API, but I don't have enough faith on this
    implementation...

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Mar 23, 2006

    Logged In: YES
    user_id=1200846

    Hello. This is my final patch. (mbcs4.patch)

    • mbcs3a.patch: _mbsbtype depends on locale not system ANSI
      code page. so probably it's not good to use it with
      MultiByteToWideChar.

    • mbcs3b.patch: CharNext may cause buffer overflow. and
      this patch always calls CharPrev but it's not needed if
      string is not terminated with "potensial" lead byte.

    I hope this is stable enough to commit on repositry. Thank you.

    @doerwalter
    Copy link
    Contributor

    Logged In: YES
    user_id=89016

    This isn't a bugfix in the strictest sense, so IMHO this
    patch shouldn't go into 2.4.

    If the patch goes into 2.5, it would need the appropriate
    changes to encodings/mbcs.py (i.e. the IncrementalDecoder
    would have to be changed (inheriting from
    BufferedIncrementalDecoder).

    I realize that this patch might be hard to test, as results
    are dependent on locale. Nevertheless at least some tests
    would be good (even if they are only run or do something
    useful on a certain locale and are skipped otherwise).

    ocean-city, can you update the patch for the trunk and add
    tests?

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Mar 24, 2006

    Logged In: YES
    user_id=1200846

    OK, I'll try.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 26, 2006

    Logged In: YES
    user_id=21627

    I have reservations against this patch because of the
    quasi-anonymous nature of the submission. ocean-city, can
    you please state your real name? Would you also be willing
    to fill out a contributor form, as shown on

    http://www.python.org/psf/contrib-form.html

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Mar 27, 2006

    Logged In: YES
    user_id=1200846

    My real name is Hirokazu Yamamoto. But sorry, I don't have
    FAX. (It's needed to send contributor form, isn't it?)

    I'll attach the patch updated for trunk. And I'll attach the
    tests.

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Mar 27, 2006

    Logged In: YES
    user_id=1200846

    I replaced tests. Probably this is better instead of
    comparing the two string generated by same decoder.

    @doerwalter
    Copy link
    Contributor

    Logged In: YES
    user_id=89016

    _buffer_decode() in the IncrementalDecoder ignores the final
    argument. IncrementalDecoder._buffer_decode() should pass on
    its final argument to _codecsmodules.c::mbcs_decode(), which
    should be extended to accept the final argument. Also
    PyUnicode_DecodeMBCSStateful() must handle consumed == NULL
    correctly (with your patch it drops trailing lead bytes even
    if consumed == NULL)

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Mar 28, 2006

    Logged In: YES
    user_id=1200846

    You are right. I've updated the patch. (mbcs5.patch)

    >>> import codecs
    [20198 refs]
    >>> d = codecs.getincrementaldecoder("mbcs")()
    [20198 refs]
    >>> d.decode('\x82\xa0\x82')
    u'\u3042'
    [20198 refs]
    >>> d.decode('')
    u''
    [20198 refs]
    >>> d.decode('', final=True)
    u'\x00'
    [20198 refs]

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Apr 7, 2006

    Logged In: YES
    user_id=1200846

    I have sent contributor form via postal mail. Probably you
    can get it after 10 days.

    @doerwalter
    Copy link
    Contributor

    Logged In: YES
    user_id=89016

    I think the default value for final in mbcs_decode() should
    be true, so that the stateless decoder detects incomplete
    byte sequences too.

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented May 2, 2006

    Logged In: YES
    user_id=1200846

    I updated the patch. (I copy and pasted "int final = 0" from
    above code (ex: utf_16_ex_decode), maybe they also should be
    changed for consistency?)

    And one more thing, I noticed "errors" is ignored now. We
    can detect invalid character if we set MB_ERR_INVALID_CHARS
    flag when calling MultiByteToWideChar, but we cannot tell
    where is the position of invalid character, and MSDN saids
    this flag is available Win2000SP4 or later (I don't know
    why)
    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_17si.asp
    So I didn't make the patch for it.

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented May 24, 2006

    Logged In: YES
    user_id=1200846

    I updated the patch.

    • PyUnicode_DecodeMBCS now supports size >= INT_MAX. (I
      don't have machine to test such big string, but I have
      tested this routine replaced INT_MAX with 2 and 3)

    PyUnicode_DecodeMBCS does not support size >= INT_MAX yet,
    but probably I'll fix it too.

    This patch includes Patch#1494487.

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented May 25, 2006

    Logged In: YES
    user_id=1200846

    PyUnicode_DecodeMBCS does not support size >= INT_MAX yet,
    but probably I'll fix it too.

    Done. Attached as "mbcs_win64_support.patch".

    Now, total summary...

    - MBCS decoder and encoder now supports 64bit Py_ssize_t
    

    environment. (I don't have such machine, but I checked
    routine by defining NEED_RETRY and redefining INT_MAX as 2,
    3, 4)

    - Fixed a bug of MBCS incremental decoder which was
    

    originaly reported by me.

    @doerwalter
    Copy link
    Contributor

    Logged In: YES
    user_id=89016

    The change to PyUnicode_Resize() should be reverted (or done
    in a way that doesn't lead to bugs).

    Unfortunately I don't have a Windows where I can test the
    patch, so I'm unassigning the bug.

    You should probably find someone on python-dev with a
    multibyte version of Windows to look at the patch.

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented May 26, 2006

    Logged In: YES
    user_id=1200846

    The change to PyUnicode_Resize() should be reverted (or done
    in a way that doesn't lead to bugs).

    Sorry, how about this?

    Index: Objects/unicodeobject.c
    ===================================================================

    --- Objects/unicodeobject.c	(revision 46417)
    +++ Objects/unicodeobject.c	(working copy)
    @@ -326,7 +326,7 @@
     	return -1;
         }
         v = (PyUnicodeObject *)*unicode;
    -    if (v == NULL || !PyUnicode_Check(v) || v->ob_refcnt !=
    1 || length < 0) {
    +    if (v == NULL || !PyUnicode_Check(v) || length < 0) {
     	PyErr_BadInternalCall();
     	return -1;
         }
    @@ -335,7 +335,7 @@
            possible since these are being shared. We simply
    return a fresh
            copy with the same Unicode content. */
         if (v->length != length &&
    -	(v == unicode_empty || v->length == 1)) {
    +	(v == unicode_empty || v->length == 1 || v->ob_refcnt != 1)) {
     	PyUnicodeObject *w = _PyUnicode_New(length);
     	if (w == NULL)
     	    return -1;

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented May 26, 2006

    Logged In: YES
    user_id=1200846

    I reverted PyUnicode_Resize() patch for now, and recreated
    the patch as "mbcs.patch".

    You should probably find someone on python-dev with a
    multibyte version of Windows to look at the patch.

    OK, I will.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jun 14, 2006

    Logged In: YES
    user_id=21627

    Thanks for the patch. Committed as r46945.

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Jul 25, 2006

    Logged In: YES
    user_id=1200846

    Sorry, I reopened this issue because I found problem.

    With attached "mbcs.py" and "mbcs.txt", result file
    "hoge.txt" gets corrupted. This happens because Codec.decode
    is called incrementally, while default value for final in
    mbcs_decode() is True.

    I think the default value for final in mbcs_decode() should
    be true, so that the stateless decoder detects incomplete
    byte sequences too.

    Probably this is not true. I think "stateless" means codec
    doesn't have internal state for incremental usage, doesn't
    mean it is not called incrementally.

    # I hope attached "fix.patch" fixes the problem.

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Jul 25, 2006

    Logged In: YES
    user_id=1200846

    >I think the default value for final in mbcs_decode() should
    >be true, so that the stateless decoder detects incomplete
    >byte sequences too.

    Probably this is not true.

    Sorry, I lied. Stateless decoder was exactly the thing
    you said.

      >>> import codecs
      >>> d = codecs.getdecoder("cp932")
      >>> d('\x82')
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
      UnicodeDecodeError: 'cp932' codec can't decode byte 0x82
      in position 0: incomplete multibyte sequence

    Problem was on StreamReader. StreamReader should treat
    "final" as false, but I didn't change this code,

      class StreamReader(Codec,codecs.StreamReader):
          pass

    so StreamReader wrongly treated "final" as True.

    I cloned routine from Lib/encoding/utf-8.py. I hope
    finally this meets requirement of codec system...

    @jwnmulder
    Copy link
    Mannequin

    jwnmulder mannequin commented Aug 1, 2006

    Logged In: YES
    user_id=770969

    could be related to bug report 1532726 ?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 2, 2006

    Logged In: YES
    user_id=21627

    jwnmulder: there is definitely no relationship.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 2, 2006

    Logged In: YES
    user_id=21627

    Thanks for the update. Committed as r51046. Please create
    new a new patch the next time you need to further changes.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants