New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
patch for mbcs codecs #43070
Comments
Hello. I have noticed mbcs codecs sometimes generates broken When I run the attached script "a.zip", the entry I think this happens because the string passed to I hope attached patch "mbcs.patch" may fix the problem. |
Logged In: YES I forgot to mention this. "mbcs.patch" is for |
Logged In: YES As I understand your comment, the mbcs codec will have a Could you add a comment regarding this to the patch ?! I can't test the patch, since I don't have a Japanese |
Logged In: YES One more nit: the doc patch is missing. Please add a patch |
Logged In: YES Thank you for reply. How about this? (I'm a newbie, I hope |
Logged In: YES Sorry, I found problem when tried more long text file... |
Logged In: YES Sorry, I was stupid. MSDN
IsDBCSLeadByte was returning 1 for some trail byte (ex: "歴"[1]) The patch "mbcs3a.patch" worked for me, but _mbsbtype is The patch "mbcs3b.patch" also worked for me and it only uses |
Logged In: YES Hello. This is my final patch. (mbcs4.patch)
I hope this is stable enough to commit on repositry. Thank you. |
Logged In: YES This isn't a bugfix in the strictest sense, so IMHO this If the patch goes into 2.5, it would need the appropriate I realize that this patch might be hard to test, as results ocean-city, can you update the patch for the trunk and add |
Logged In: YES OK, I'll try. |
Logged In: YES I have reservations against this patch because of the |
Logged In: YES My real name is Hirokazu Yamamoto. But sorry, I don't have I'll attach the patch updated for trunk. And I'll attach the |
Logged In: YES I replaced tests. Probably this is better instead of |
Logged In: YES _buffer_decode() in the IncrementalDecoder ignores the final |
Logged In: YES You are right. I've updated the patch. (mbcs5.patch) >>> import codecs
[20198 refs]
>>> d = codecs.getincrementaldecoder("mbcs")()
[20198 refs]
>>> d.decode('\x82\xa0\x82')
u'\u3042'
[20198 refs]
>>> d.decode('')
u''
[20198 refs]
>>> d.decode('', final=True)
u'\x00'
[20198 refs] |
Logged In: YES I have sent contributor form via postal mail. Probably you |
Logged In: YES I think the default value for final in mbcs_decode() should |
Logged In: YES I updated the patch. (I copy and pasted "int final = 0" from And one more thing, I noticed "errors" is ignored now. We |
Logged In: YES I updated the patch.
PyUnicode_DecodeMBCS does not support size >= INT_MAX yet, This patch includes Patch#1494487. |
Logged In: YES
Done. Attached as "mbcs_win64_support.patch". Now, total summary...
environment. (I don't have such machine, but I checked
originaly reported by me. |
Logged In: YES The change to PyUnicode_Resize() should be reverted (or done Unfortunately I don't have a Windows where I can test the You should probably find someone on python-dev with a |
Logged In: YES
Sorry, how about this? Index: Objects/unicodeobject.c --- Objects/unicodeobject.c (revision 46417)
+++ Objects/unicodeobject.c (working copy)
@@ -326,7 +326,7 @@
return -1;
}
v = (PyUnicodeObject *)*unicode;
- if (v == NULL || !PyUnicode_Check(v) || v->ob_refcnt !=
1 || length < 0) {
+ if (v == NULL || !PyUnicode_Check(v) || length < 0) {
PyErr_BadInternalCall();
return -1;
}
@@ -335,7 +335,7 @@
possible since these are being shared. We simply
return a fresh
copy with the same Unicode content. */
if (v->length != length &&
- (v == unicode_empty || v->length == 1)) {
+ (v == unicode_empty || v->length == 1 || v->ob_refcnt != 1)) {
PyUnicodeObject *w = _PyUnicode_New(length);
if (w == NULL)
return -1; |
Logged In: YES I reverted PyUnicode_Resize() patch for now, and recreated
OK, I will. |
Logged In: YES Thanks for the patch. Committed as r46945. |
Logged In: YES Sorry, I reopened this issue because I found problem. With attached "mbcs.py" and "mbcs.txt", result file
Probably this is not true. I think "stateless" means codec # I hope attached "fix.patch" fixes the problem. |
Logged In: YES
Sorry, I lied. Stateless decoder was exactly the thing >>> import codecs
>>> d = codecs.getdecoder("cp932")
>>> d('\x82')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'cp932' codec can't decode byte 0x82
in position 0: incomplete multibyte sequence Problem was on StreamReader. StreamReader should treat class StreamReader(Codec,codecs.StreamReader):
pass so StreamReader wrongly treated "final" as True. I cloned routine from Lib/encoding/utf-8.py. I hope |
Logged In: YES could be related to bug report 1532726 ? |
Logged In: YES jwnmulder: there is definitely no relationship. |
Logged In: YES Thanks for the update. Committed as r51046. Please create |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: