classification
Title: ISO-2022 seeking forgets state
Type: behavior Stage: test needed
Components: Extension Modules, IO, Unicode Versions: Python 3.6, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, johnwalker, martin.panter, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2015-12-15 02:37 by martin.panter, last changed 2016-01-20 02:09 by martin.panter.

Files
File name Uploaded Description Edit
25863-unittest.patch johnwalker, 2015-12-30 05:27 review
Messages (4)
msg256431 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-12-15 02:37
>>> from io import *
>>> text = TextIOWrapper(BytesIO(), "iso-2022-jp")
>>> text.write(u"P")
1
>>> text.tell()
1
>>> text.write(u"anter 正")
7
>>> text.tell()
12
>>> text.write(u"孝")
1
>>> text.seek(12)
12
>>> text.read()  # Should return 孝, not ASCII
"9'"
>>> text.buffer.getvalue()
b"Panter \x1b$B@59'"
>>> text.seek(1)
1
>>> text.read(7)
'anter 正'
>>> text.tell()  # Another bug?
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'iso2022_jp' codec can't decode bytes in position 2-3: illegal multibyte sequence
msg257144 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2015-12-28 22:43
I confirmed the problem on default (3.6) and verified that it works as expected using utf-8 instead of iso-2022-jp.
The code in the above message should be converted into a unittest, and related codecs should be checked as well.
The problem is probably in Modules/cjkcodecs/_codecs_iso2022.c
msg257227 - (view) Author: John Walker (johnwalker) * Date: 2015-12-30 05:27
Here is Martin's message as a unit test. It checks utf-8 and the iso-2022 family except iso-2022-cn and iso-2022-cn-ext because they are not supported. The errors occur with all iso-2022 charsets.
msg258635 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-01-20 02:09
After thinking about Issue 26158, I realize the seek() magic numbers don’t store any _encoder_ state, only _decoder_ state. That would explain the first bug (write, seek, then read). Though for this codec I suspect the decoder state is not recorded either, hence the bug with tell().

Personally I don’t care much for seeking text files. But if someone wanted to fix the second bug, that might require fixing the incremental decoder’s getstate() implementation.
History
Date User Action Args
2016-01-20 02:09:55martin.pantersetmessages: + msg258635
2015-12-30 05:27:05johnwalkersetfiles: + 25863-unittest.patch
keywords: + patch
messages: + msg257227
2015-12-30 00:11:10johnwalkersetnosy: + johnwalker
2015-12-28 22:43:44ezio.melottisetmessages: + msg257144
components: + Extension Modules
stage: needs patch -> test needed
2015-12-18 18:34:21serhiy.storchakasetnosy: + serhiy.storchaka

stage: needs patch
2015-12-15 02:37:24martin.pantercreate