Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cjkcodecs missing getstate and setstate implementations #77759

Closed
libcthorne mannequin opened this issue May 19, 2018 · 7 comments
Closed

cjkcodecs missing getstate and setstate implementations #77759

libcthorne mannequin opened this issue May 19, 2018 · 7 comments
Labels
3.8 only security fixes tests Tests in the Lib/test dir topic-IO type-feature A feature request or enhancement

Comments

@libcthorne
Copy link
Mannequin

libcthorne mannequin commented May 19, 2018

BPO 33578
Nosy @methane, @vadmium, @miss-islington, @libcthorne
PRs
  • bpo-33578: Add getstate/setstate for CJK codec #6984
  • bpo-33578: Fix getstate/setstate for CJK decoder #10290
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-11-06.07:05:53.693>
    created_at = <Date 2018-05-19.07:00:05.994>
    labels = ['3.8', 'type-feature', 'tests', 'expert-IO']
    title = 'cjkcodecs missing getstate and setstate implementations'
    updated_at = <Date 2018-11-06.07:05:53.692>
    user = 'https://github.com/libcthorne'

    bugs.python.org fields:

    activity = <Date 2018-11-06.07:05:53.692>
    actor = 'methane'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-11-06.07:05:53.693>
    closer = 'methane'
    components = ['Tests', 'IO']
    creation = <Date 2018-05-19.07:00:05.994>
    creator = 'libcthorne'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 33578
    keywords = ['patch']
    message_count = 7.0
    messages = ['317106', '318726', '318742', '318801', '318987', '329053', '329103']
    nosy_count = 4.0
    nosy_names = ['methane', 'martin.panter', 'miss-islington', 'libcthorne']
    pr_nums = ['6984', '10290']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue33578'
    versions = ['Python 3.8']

    @libcthorne
    Copy link
    Mannequin Author

    libcthorne mannequin commented May 19, 2018

    The cjkcodecs module provides support for various Chinese/Japanese/Korean codecs through its MultibyteIncrementalDecoder and MultibyteIncrementalEncoder classes.

    While working with one of these cjkcodecs (euc-jp), I came across an issue where calling TextIOWrapper.tell() was causing a decode error on a valid euc-jp file:

    >>> with open("/tmp/test", "w", encoding="euc-jp") as f:
    ...     f.write("AB\nうえ\n")
    ...
    6
    >>> with open("/tmp/test", "r", encoding="euc-jp") as f:
    ...     f.readline()
    ...     f.tell()
    ...     f.readline()
    ...     f.readline()
    ...
    'AB\n'
    4
    'うえ\n'
    Traceback (most recent call last):
      File "<stdin>", line 5, in <module>
    UnicodeDecodeError: 'euc_jp' codec can't decode byte 0xa4 in position 0: incomplete multibyte sequence

    Without the tell(), there is no problem:

    >>> with open("/tmp/test", "w", encoding="euc-jp") as f:
    ...     f.write("AB\nうえ\n")
    ...
    6
    >>> with open("/tmp/test", "r", encoding="euc-jp") as f:
    ...     f.readline()
    ...     f.readline()
    ...     f.readline()
    ...
    'AB\n'
    'うえ\n'
    ''

    After looking at _io_TextIOWrapper_tell_impl in textio.c, I understood that tell() is not as trivial as one might expect as it is using a "reconstruction algorithm" [1] to account for buffered chunk reads. By default, TextIOWrapper reads from its underlying stream in chunks of 8092 bytes and then decodes and buffers this data for speed efficiency. As a result, TextIOWrapper.tell() can't just call tell() on the underlying stream because the stream's tell() will usually be too far ahead. Thus, a reconstruction algorithm is used to calculate how many bytes of the buffered chunk have actually been read so far by the user.

    The reconstruction algorithm could be trivial for single byte codecs - just track the number of characters read so far, e.g. each time read() is called. However, for multibyte codecs where the number of bytes representing a character is not uniform nor reported by the decoder, the character count alone isn't sufficient. What CPython does for this is jump to a "snapshot" point where the decoder is in a clean state (i.e. not halfway through a multibyte read) and feeds it bytes to generate characters, counting as it goes, until the number of characters it tracked from user reads are generated. From this, it knows the number of bytes it took to reach the point the user is at and can calculate the correct tell() value.

    Now this is where the problem comes in: the reconstruction algorithm depends on a reliable way to detect when the decoder is in a "clean state". The getstate method [2], which returns any pending data the decoder has, is used for this. However, in the case of cjkcodecs, the getstate implementation is missing. This can be seen in the following example:

    >>> import codecs
    >>> decoder = codecs.getincrementaldecoder('euc_jp')()
    ... decoder.decode(b'\xa4\xa6') # Decode a complete input sequence
    'う'
    >>> decoder.getstate() # There should be no state here (working)
    (b'', 0)
    >>> decoder.decode(b'\xa4') # Decode first half of a partial input sequence
    ''
    >>> decoder.getstate() # There should be state here (not working)
    (b'', 0)

    Internally, however, the cjkcodecs do hold state. They just don't expose it.

    This leads to unexpected results in the reconstruction algorithm, such as the tell() bug demonstrated above.

    It appears decoder.setstate [3], encoder.getstate [4], encoder.setstate [5], are also not implemented for the cjkcodecs. This leads to other issues in code that assumes their implementaton is present (e.g. TextIOWrapper.seek).

    I will attach a PR shortly with some tests and proposed implementations. This is my first time working with the CPython codebase, so apologies if I've overlooked anything.

    Finally, here is a somewhat related issue with a mention of the same tell() bug at the end: https://bugs.python.org/issue25863
    However, the main problem identified in that report requires further changes than just adding getstate as it's caused by the more fundamental issue of encoder and decoder state not being kept in sync.
    (Also, I have added the reporter of that issue to the nosy list for this one as I assume they have some familiarity with this bug)

    [1] https://docs.python.org/3/library/io.html#id3
    [2] https://docs.python.org/3/library/codecs.html#codecs.IncrementalDecoder.getstate
    [3] https://docs.python.org/3/library/codecs.html#codecs.IncrementalDecoder.setstate
    [4] https://docs.python.org/3/library/codecs.html#codecs.IncrementalEncoder.getstate
    [5] https://docs.python.org/3/library/codecs.html#codecs.IncrementalEncoder.setstate

    @libcthorne libcthorne mannequin added type-bug An unexpected behavior, bug, or error 3.7 (EOL) end of life 3.8 only security fixes tests Tests in the Lib/test dir topic-IO labels May 19, 2018
    @methane
    Copy link
    Member

    methane commented Jun 5, 2018

    ISO-2022-* is "stateful" encoding. It uses escape sequence to change state.
    So keeping only pending is not enough.

    >>> enc.encode("abcあいう")
    b'abc\x1b$B$"$$$&'
    >>> enc.getstate()
    0
    >>> enc.encode("abc")
    b'\x1b(Babc'
    
    >>> enc.encode("abcあいう")
    b'abc\x1b$B$"$$$&'
    >>> enc.getstate()
    0
    >>> enc.setstate(0)
    >>> enc.encode("abc")
    b'abc'

    I don't know much about other encodings.

    @libcthorne
    Copy link
    Mannequin Author

    libcthorne mannequin commented Jun 5, 2018

    Ah, good find. I suppose that means MultibyteCodec_State and pending are both needed to fully capture state, as is done in decoder.getstate/setstate by returning a tuple of both. Unfortunately encoder.getstate is defined to return an integer, and because MultibyteCodec_State can occupy 8 bytes, and pending can occupy 2 bytes (MAXENCPENDING), we get a total of 10 bytes which I think exceeds what a PyLong can represent.

    Returning either pending or MultibyteCodec_State seems infeasible because setstate will not know how to process it, and both may be needed together.

    Some alternatives could be:

    1. If we are restricted to returning an integer, perhaps this integer could be an index that references a state in a pool of encoder states stored internally (effectively a pointer). Managing this state pool seems quite complex.

    2. encoder.getstate could be redefined to return a tuple, but obviously this is a breaking change. Backwards compatibility could be somewhat preserved by allowing setstate to accept either an integer or tuple.

    3. Remove PyObject *pending from MultibyteStatefulEncoderContext and change encoders to only use MultibyteCodec_State. Not sure how feasible this is.

    I think approach 2 would be simplest and matches the decoder interface.

    Does anyone have any opinions or further alternatives?

    @methane
    Copy link
    Member

    methane commented Jun 6, 2018

    MultibyteCodec_State can occupy 8 bytes, and pending can occupy 2 bytes (MAXENCPENDING), we get a total of 10 bytes which I think exceeds what a PyLong can represent.

    PyLong is "long integer", aka "big integer", not C's long type.
    https://docs.python.org/3.6/c-api/long.html

    You can encode 12 bytes into single long object.

    @libcthorne
    Copy link
    Mannequin Author

    libcthorne mannequin commented Jun 7, 2018

    Thanks Naoki, that simplifies things a lot.

    I've updated the PR with a new test case for ISO-2022-JP and a corresponding implementation for the encoder state methods.

    @methane methane added type-feature A feature request or enhancement and removed 3.7 (EOL) end of life type-bug An unexpected behavior, bug, or error labels Jun 11, 2018
    @miss-islington
    Copy link
    Contributor

    New changeset ac22f6a by Miss Islington (bot) (Christopher Thorne) in branch 'master':
    bpo-33578: Add getstate/setstate for CJK codec (GH-6984)
    ac22f6a

    @methane
    Copy link
    Member

    methane commented Nov 2, 2018

    New changeset 488c0a6 by INADA Naoki (Christopher Thorne) in branch 'master':
    bpo-33578: Fix getstate/setstate for CJK decoder (GH-10290)
    488c0a6

    @methane methane closed this as completed Nov 6, 2018
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes tests Tests in the Lib/test dir topic-IO type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants