This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Crash in the reset() method of StreamWriter of CJK codecs
Type: crash Stage: resolved
Components: Unicode Versions: Python 3.6, Python 3.4, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: vstinner Nosy List: Aaron1011, ezio.melotti, martin.panter, python-dev, serhiy.storchaka, vstinner
Priority: high Keywords: patch

Created on 2015-01-16 06:17 by martin.panter, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
cjk_codecs_reset.patch vstinner, 2015-07-15 23:00 review
fix-multibytecodec-segfault.patch Aaron1011, 2015-07-16 03:29 review
fix-multibytecodec-segfault-with-test.patch Aaron1011, 2015-07-16 16:47 review
Messages (13)
msg234112 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-01-16 06:17
$ python3 -c 'import codecs; from io import BytesIO; codecs.getwriter("big5")(BytesIO()).reset()'
Segmentation fault (core dumped)
[Exit 139]

Happens for all the multibyte codecs:
broken_stream_codecs = {
    "big5", "big5hkscs", "cp932", "cp949", "cp950",
    "euc_jp", "euc_jis_2004", "euc_jisx0213", "euc_kr",
    "gb2312", "gbk", "gb18030", "hz",
    "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2", "iso2022_jp_2004",
    "iso2022_jp_3", "iso2022_jp_ext", "iso2022_kr",
    "johab", "shift_jis", "shift_jis_2004", "shift_jisx0213",
}

These codecs also share the property that their StreamReader.read() methods do not accept the second “chars” parameter:

>>> codecs.getreader("big5")(BytesIO()).read(1, 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: read expected at most 1 arguments, got 2
msg246781 - (view) Author: Aaron Hill (Aaron1011) * Date: 2015-07-15 22:27
This is also present in the latest Python 3.6.

I'm going to work on providing a patch for this, unless someone else already is
msg246782 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-07-15 23:00
> Happens for all the multibyte codecs: (...)

All these codecs share the same C implementation: the "CJK" codecs.

The bug was introduced in Python 3.4 by my huge changeset d621bdaed7c3: Issue "#17693: CJK encoders now use the new Unicode API (PEP 393)".

It looks like there was no test for this specific bug :-/ Calling reset() just after creating a StreamReader object.

Attached patch should fix the crash.

@Aaron1011: Can you please try to write a patch adding a unit test to test_codecs reproducing the crash?
msg246783 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-07-15 23:02
> It looks like there was no test for this specific bug :-/ Calling reset() just after creating a StreamReader object.

Oops: StreamWriter, not StreamReader.
msg246784 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-07-16 00:56
Perhaps this test case proposed in my patch for Issue 13881 could be useful:

+def test_writer_reuse(self):
+    """StreamWriter should be reusable after reset"""

Looks like that is where my “broken_stream_codecs” list from the original post came from.
msg246785 - (view) Author: Aaron Hill (Aaron1011) * Date: 2015-07-16 02:57
The included patch fixes the issue, and modifies the existing unittest to prevent a future regression.

The patch corrects an issue where the 'pending' struct field was NULL, but was used as the input to multibytecodec_encode anyay.
msg246786 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-07-16 03:27
cjk_codecs_reset.patch LGTM.
msg246787 - (view) Author: Aaron Hill (Aaron1011) * Date: 2015-07-16 03:29
The patch didn't get attached for some reason. It's attached now.
msg246801 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-07-16 12:57
Aaron: Your version of the fix immediately returns None, while Victor’s tries to encode an empty string (if I understand it correctly). I imagine this shortcut could be slightly more efficient, but is it always correct?

In any case, Aaron’s test looks okay to me.
msg246802 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-07-16 13:01
> Aaron: Your version of the fix immediately returns None, while Victor’s tries to encode an empty string (if I understand it correctly).

Aaron patch is better. I misunderstood the code. reset() always return None, the empty string is unused in my patch.

fix-multibytecodec-segfault.patch: please write a _new_ test mentionning this issue to check for non regression.
msg246810 - (view) Author: Aaron Hill (Aaron1011) * Date: 2015-07-16 16:47
I've added a test case to exercise reset()
msg246818 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-07-16 20:22
New changeset 35a6fe0e2b27 by Victor Stinner in branch '3.4':
Closes #23247: Fix a crash in the StreamWriter.reset() of CJK codecs
https://hg.python.org/cpython/rev/35a6fe0e2b27
msg246819 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-07-16 20:25
@Aaron Hill: Thanks for your patch! I only kept the minimal test of your patch. If you want to enhance the test suite, you may write new test to test the behaviour of reset(). I prefer to only commit the minimum patch.

@Martin: Thanks for your report. The issue is now fixed.
History
Date User Action Args
2022-04-11 14:58:11adminsetgithub: 67436
2015-07-16 20:25:47vstinnersetmessages: + msg246819
2015-07-16 20:22:58python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg246818

resolution: fixed
stage: patch review -> resolved
2015-07-16 16:47:36Aaron1011setfiles: + fix-multibytecodec-segfault-with-test.patch

messages: + msg246810
2015-07-16 13:01:32vstinnersetmessages: + msg246802
2015-07-16 12:57:06martin.pantersetmessages: + msg246801
stage: test needed -> patch review
2015-07-16 03:29:46Aaron1011setfiles: + fix-multibytecodec-segfault.patch

messages: + msg246787
2015-07-16 03:27:43serhiy.storchakasetassignee: serhiy.storchaka -> vstinner
messages: + msg246786
stage: test needed
2015-07-16 02:57:43Aaron1011setmessages: + msg246785
2015-07-16 00:56:58martin.pantersetmessages: + msg246784
2015-07-15 23:02:01vstinnersetmessages: + msg246783
2015-07-15 23:00:54vstinnersetfiles: + cjk_codecs_reset.patch
versions: + Python 3.5
title: Multibyte codec StreamWriter.reset() crashes -> Crash in the reset() method of StreamWriter of CJK codecs
messages: + msg246782

keywords: + patch
2015-07-15 22:27:47Aaron1011setnosy: + Aaron1011

messages: + msg246781
versions: + Python 3.6
2015-02-28 13:34:48serhiy.storchakasetpriority: normal -> high
assignee: serhiy.storchaka

nosy: + serhiy.storchaka
2015-01-16 06:17:50martin.pantercreate