Message 317106 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	libcthorne
Recipients	libcthorne, martin.panter
Date	2018-05-19.07:00:05
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1526713206.06.0.682650639539.issue33578@psf.upfronthosting.co.za>
In-reply-to

Content
The cjkcodecs module provides support for various Chinese/Japanese/Korean codecs through its MultibyteIncrementalDecoder and MultibyteIncrementalEncoder classes. While working with one of these cjkcodecs (euc-jp), I came across an issue where calling TextIOWrapper.tell() was causing a decode error on a valid euc-jp file: >>> with open("/tmp/test", "w", encoding="euc-jp") as f: ... f.write("AB\nうえ\n") ... 6 >>> with open("/tmp/test", "r", encoding="euc-jp") as f: ... f.readline() ... f.tell() ... f.readline() ... f.readline() ... 'AB\n' 4 'うえ\n' Traceback (most recent call last): File "<stdin>", line 5, in <module> UnicodeDecodeError: 'euc_jp' codec can't decode byte 0xa4 in position 0: incomplete multibyte sequence Without the tell(), there is no problem: >>> with open("/tmp/test", "w", encoding="euc-jp") as f: ... f.write("AB\nうえ\n") ... 6 >>> with open("/tmp/test", "r", encoding="euc-jp") as f: ... f.readline() ... f.readline() ... f.readline() ... 'AB\n' 'うえ\n' '' After looking at _io_TextIOWrapper_tell_impl in textio.c, I understood that tell() is not as trivial as one might expect as it is using a "reconstruction algorithm" [1] to account for buffered chunk reads. By default, TextIOWrapper reads from its underlying stream in chunks of 8092 bytes and then decodes and buffers this data for speed efficiency. As a result, TextIOWrapper.tell() can't just call tell() on the underlying stream because the stream's tell() will usually be too far ahead. Thus, a reconstruction algorithm is used to calculate how many bytes of the buffered chunk have actually been read so far by the user. The reconstruction algorithm could be trivial for single byte codecs - just track the number of characters read so far, e.g. each time read() is called. However, for multibyte codecs where the number of bytes representing a character is not uniform nor reported by the decoder, the character count alone isn't sufficient. What CPython does for this is jump to a "snapshot" point where the decoder is in a clean state (i.e. not halfway through a multibyte read) and feeds it bytes to generate characters, counting as it goes, until the number of characters it tracked from user reads are generated. From this, it knows the number of bytes it took to reach the point the user is at and can calculate the correct tell() value. Now this is where the problem comes in: the reconstruction algorithm depends on a reliable way to detect when the decoder is in a "clean state". The getstate method [2], which returns any pending data the decoder has, is used for this. However, in the case of cjkcodecs, the getstate implementation is missing. This can be seen in the following example: >>> import codecs >>> decoder = codecs.getincrementaldecoder('euc_jp')() ... decoder.decode(b'\xa4\xa6') # Decode a complete input sequence 'う' >>> decoder.getstate() # There should be no state here (working) (b'', 0) >>> decoder.decode(b'\xa4') # Decode first half of a partial input sequence '' >>> decoder.getstate() # There should be state here (not working) (b'', 0) Internally, however, the cjkcodecs do hold state. They just don't expose it. This leads to unexpected results in the reconstruction algorithm, such as the tell() bug demonstrated above. It appears decoder.setstate [3], encoder.getstate [4], encoder.setstate [5], are also not implemented for the cjkcodecs. This leads to other issues in code that assumes their implementaton is present (e.g. TextIOWrapper.seek). I will attach a PR shortly with some tests and proposed implementations. This is my first time working with the CPython codebase, so apologies if I've overlooked anything. Finally, here is a somewhat related issue with a mention of the same tell() bug at the end: https://bugs.python.org/issue25863 However, the main problem identified in that report requires further changes than just adding getstate as it's caused by the more fundamental issue of encoder and decoder state not being kept in sync. (Also, I have added the reporter of that issue to the nosy list for this one as I assume they have some familiarity with this bug) [1] https://docs.python.org/3/library/io.html#id3 [2] https://docs.python.org/3/library/codecs.html#codecs.IncrementalDecoder.getstate [3] https://docs.python.org/3/library/codecs.html#codecs.IncrementalDecoder.setstate [4] https://docs.python.org/3/library/codecs.html#codecs.IncrementalEncoder.getstate [5] https://docs.python.org/3/library/codecs.html#codecs.IncrementalEncoder.setstate

The cjkcodecs module provides support for various Chinese/Japanese/Korean codecs through its MultibyteIncrementalDecoder and MultibyteIncrementalEncoder classes.

While working with one of these cjkcodecs (euc-jp), I came across an issue where calling TextIOWrapper.tell() was causing a decode error on a valid euc-jp file:

>>> with open("/tmp/test", "w", encoding="euc-jp") as f:
...     f.write("AB\nうえ\n")
...
6
>>> with open("/tmp/test", "r", encoding="euc-jp") as f:
...     f.readline()
...     f.tell()
...     f.readline()
...     f.readline()
...
'AB\n'
4
'うえ\n'
Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
UnicodeDecodeError: 'euc_jp' codec can't decode byte 0xa4 in position 0: incomplete multibyte sequence

Without the tell(), there is no problem:

>>> with open("/tmp/test", "w", encoding="euc-jp") as f:
...     f.write("AB\nうえ\n")
...
6
>>> with open("/tmp/test", "r", encoding="euc-jp") as f:
...     f.readline()
...     f.readline()
...     f.readline()
...
'AB\n'
'うえ\n'
''

After looking at _io_TextIOWrapper_tell_impl in textio.c, I understood that tell() is not as trivial as one might expect as it is using a "reconstruction algorithm" [1] to account for buffered chunk reads. By default, TextIOWrapper reads from its underlying stream in chunks of 8092 bytes and then decodes and buffers this data for speed efficiency. As a result, TextIOWrapper.tell() can't just call tell() on the underlying stream because the stream's tell() will usually be too far ahead. Thus, a reconstruction algorithm is used to calculate how many bytes of the buffered chunk have actually been read so far by the user.

The reconstruction algorithm could be trivial for single byte codecs - just track the number of characters read so far, e.g. each time read() is called. However, for multibyte codecs where the number of bytes representing a character is not uniform nor reported by the decoder, the character count alone isn't sufficient. What CPython does for this is jump to a "snapshot" point where the decoder is in a clean state (i.e. not halfway through a multibyte read) and feeds it bytes to generate characters, counting as it goes, until the number of characters it tracked from user reads are generated. From this, it knows the number of bytes it took to reach the point the user is at and can calculate the correct tell() value.

Now this is where the problem comes in: the reconstruction algorithm depends on a reliable way to detect when the decoder is in a "clean state". The getstate method [2], which returns any pending data the decoder has, is used for this. However, in the case of cjkcodecs, the getstate implementation is missing. This can be seen in the following example:

>>> import codecs
>>> decoder = codecs.getincrementaldecoder('euc_jp')()
... decoder.decode(b'\xa4\xa6') # Decode a complete input sequence
'う'
>>> decoder.getstate() # There should be no state here (working)
(b'', 0)
>>> decoder.decode(b'\xa4') # Decode first half of a partial input sequence
''
>>> decoder.getstate() # There should be state here (not working)
(b'', 0)

Internally, however, the cjkcodecs do hold state. They just don't expose it.

This leads to unexpected results in the reconstruction algorithm, such as the tell() bug demonstrated above.

It appears decoder.setstate [3], encoder.getstate [4], encoder.setstate [5], are also not implemented for the cjkcodecs. This leads to other issues in code that assumes their implementaton is present (e.g. TextIOWrapper.seek).

I will attach a PR shortly with some tests and proposed implementations. This is my first time working with the CPython codebase, so apologies if I've overlooked anything.

Finally, here is a somewhat related issue with a mention of the same tell() bug at the end: https://bugs.python.org/issue25863
However, the main problem identified in that report requires further changes than just adding getstate as it's caused by the more fundamental issue of encoder and decoder state not being kept in sync.
(Also, I have added the reporter of that issue to the nosy list for this one as I assume they have some familiarity with this bug)

[1] https://docs.python.org/3/library/io.html#id3
[2] https://docs.python.org/3/library/codecs.html#codecs.IncrementalDecoder.getstate
[3] https://docs.python.org/3/library/codecs.html#codecs.IncrementalDecoder.setstate
[4] https://docs.python.org/3/library/codecs.html#codecs.IncrementalEncoder.getstate
[5] https://docs.python.org/3/library/codecs.html#codecs.IncrementalEncoder.setstate

History
Date	User	Action	Args
2018-05-19 07:00:06	libcthorne	set	recipients: + libcthorne, martin.panter
2018-05-19 07:00:06	libcthorne	set	messageid: <1526713206.06.0.682650639539.issue33578@psf.upfronthosting.co.za>
2018-05-19 07:00:05	libcthorne	link	issue33578 messages
2018-05-19 07:00:05	libcthorne	create