Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream encoder for zlib_codec doesn't use the incremental encoder #58089

Open
amcnabb mannequin opened this issue Jan 26, 2012 · 13 comments
Open

Stream encoder for zlib_codec doesn't use the incremental encoder #58089

amcnabb mannequin opened this issue Jan 26, 2012 · 13 comments
Labels
topic-IO type-bug An unexpected behavior, bug, or error

Comments

@amcnabb
Copy link
Mannequin

amcnabb mannequin commented Jan 26, 2012

BPO 13881
Nosy @malemburg, @loewis, @doerwalter, @jcea, @vstinner, @vadmium
Dependencies
  • bpo-23231: Fix codecs.iterencode/decode() by allowing data parameter to be omitted
  • Files
  • zlib-bz2-writer.patch: Depends on Issue 23231
  • zlib-bz2-writer.v2.patch: Depends on Issue 23231
  • zlib-bz2-writer.v3.patch: Depends on Issue 23231
  • zlib-bz2-writer.v4.patch: Depends on Issue 23231
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2012-01-26.19:40:48.508>
    labels = ['type-bug', 'expert-IO']
    title = "Stream encoder for zlib_codec doesn't use the incremental encoder"
    updated_at = <Date 2015-07-16.02:08:14.288>
    user = 'https://bugs.python.org/amcnabb'

    bugs.python.org fields:

    activity = <Date 2015-07-16.02:08:14.288>
    actor = 'martin.panter'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['IO']
    creation = <Date 2012-01-26.19:40:48.508>
    creator = 'amcnabb'
    dependencies = ['23231']
    files = ['37702', '37710', '37719', '37847']
    hgrepos = []
    issue_num = 13881
    keywords = ['patch']
    message_count = 13.0
    messages = ['152029', '153142', '153358', '153372', '207853', '226680', '226681', '233983', '234008', '234048', '234067', '234101', '234653']
    nosy_count = 7.0
    nosy_names = ['lemburg', 'loewis', 'doerwalter', 'jcea', 'amcnabb', 'vstinner', 'martin.panter']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue13881'
    versions = ['Python 3.3', 'Python 3.5']

    @amcnabb
    Copy link
    Mannequin Author

    amcnabb mannequin commented Jan 26, 2012

    The stream encoder for the zlib_codec doesn't use the incremental encoder, so it has limited usefulness in practice. This is easiest to show with an example.

    Here is the behavior with the stream encoder:

    >>> filelike = io.BytesIO()
    >>> wrapped = codecs.getwriter('zlib_codec')(filelike)
    >>> wrapped.write(b'hello')
    >>> filelike.getvalue()
    b'x\x9c\xab\x00\x00\x00y\x00y'
    >>> wrapped.write(b'x')
    >>> filelike.getvalue()
    b'x\x9c\xab\x00\x00\x00y\x00yx\x9c\xab\x00\x00\x00y\x00y'
    >>>

    However, this is the behavior of the incremental encoder:

    >>> ienc = codecs.getincrementalencoder('zlib_codec')()
    >>> ienc.encode(b'x')
    b'x\x9c'
    >>> ienc.encode(b'x', final=True)
    b'\xab\xa8\x00\x00\x01j\x00\xf1'
    >>>

    The stream encoder is apparently encoding each write as an individual block, but the incremental encoder buffers until it gets a block that's large enough to be meaningfully compressed.

    Fixing this may require addressing a separate issue with stream encoders. Unlike with the GzipFile module, closing a stream encoder closes the underlying file. If this underlying file is a BytesIO, then closing makes it free its buffer, making it impossible to get at the completed file.

    @amcnabb amcnabb mannequin added topic-IO type-bug An unexpected behavior, bug, or error labels Jan 26, 2012
    @jcea
    Copy link
    Member

    jcea commented Feb 11, 2012

    Andrew, could you possibly write a patch and a test for 3.3?

    @amcnabb
    Copy link
    Mannequin Author

    amcnabb mannequin commented Feb 14, 2012

    It looks like encodings/zlib_codec.py defines a custom IncrementalEncoder and IncrementalDecoder, but its StreamWriter and StreamReader rely on the standard implementation of codecs.StreamWriter and codecs.StreamReader.

    One solution might be to have zlib_codec.StreamWriter inherit from zlib_codec.IncrementalEncoder instead of from zlib_encoder.Codec. I'm not familiar enough with the codecs library to know whether this is the best approach.

    Unfortunately, there are 120 codec files in the encodings directory, and it's unclear how many of them would need to be modified. Based on the number of them that implement StreamWriter as "class StreamWriter(Codec,codecs.StreamWriter)", it looks like it might be a lot of them. Was each of these 120 files hand-written?

    @vstinner
    Copy link
    Member

    See also issue bpo-7475.

    @doerwalter
    Copy link
    Contributor

    The stream part of the codecs isn't used that much in Python 3 any more, so I'm not sure if this is worth fixing.

    @vadmium
    Copy link
    Member

    vadmium commented Sep 10, 2014

    The corresponding stream reader has a related issue which I mentioned in bpo-20132

    @vadmium
    Copy link
    Member

    vadmium commented Sep 10, 2014

    Even if all these issues aren’t worth fixing, I think the documentation should at least say which codecs work fully (e.g. most text encodings), which ones work suboptimally (e.g. UTF-7), and which ones only work for single-shot encoding and decoding.

    @vadmium
    Copy link
    Member

    vadmium commented Jan 14, 2015

    See bpo-23231 for a proposal which should make the incremental codec API compatible with a generic StreamReader/Writer class.

    I discovered that many of the codec files are generated by gencodec.py, not hand-written. However when I tried regenerating them, I found a handful had diverged from the generated code, but others were more or less true to the generated code.

    @vadmium
    Copy link
    Member

    vadmium commented Jan 14, 2015

    Here is a patch to implement the zlib-codec and bz2-codec StreamWriter classes based on their IncrementalEncoder classes. It depends on my patch for bpo-23231, though I guess it could be tweaked to work around that if desired.

    @vadmium
    Copy link
    Member

    vadmium commented Jan 15, 2015

    New patch that also fixes StreamWriter.writelines() in general for the byte codecs

    @malemburg
    Copy link
    Member

    On 15.01.2015 05:43, Martin Panter wrote:

    New patch that also fixes StreamWriter.writelines() in general for the byte codecs

    Could you explain this new undocumented class ?

    +class _IncrementalBasedWriter(StreamWriter):
    + """Generic StreamWriter implementation.
    +
    + The _EncoderClass attribute must be set to an IncrementalEncoder
    + class to use.
    + """
    +
    + def __init__(self, stream, errors='strict'):
    + super().__init__(stream, errors)
    + self._encoder = self._Encoder(errors)
    +
    + def write(self, object):
    + self.stream.write(self._encoder.encode(object))
    +
    + def reset(self):
    + self.stream.write(self._encoder.encode(final=True))
    +

    Note that the doc-string mentions a non-existing attribute and there
    are doc-string missing for the other methods.

    The purpose appears to be a StreamWriter which works with
    an IncrementalEncoder. A proper name would thus be
    IncrementalStreamWriter which provides an .encode()
    method which adapts the signature of the incremental encoder
    to the one expected for StreamWriters and Codecs.

    @vadmium
    Copy link
    Member

    vadmium commented Jan 15, 2015

    Sorry, I changed the name of the attribute and forgot to update the doc string. Its new name was _Encoder.

    Your description was fairly accurate. I am adding patch v3, with an expanded the doc string. Hopefully that explains it a bit better. Since it is just implementing the documented StreamWriter API, I only added brief descriptions of the methods pointing back to the StreamWriter methods.

    @vadmium
    Copy link
    Member

    vadmium commented Jan 25, 2015

    Here is patch v4. The stream writer is now automatically generated by default by the CodecInfo constructor if no custom stream writer parameter is supplied. The base64 and quopri codecs are adjusted to also use this default stream writer to help with bpo-20132.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-IO type-bug An unexpected behavior, bug, or error
    Projects
    Status: No status
    Development

    No branches or pull requests

    5 participants