Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix codecs.iterencode/decode() by allowing data parameter to be omitted #67420

Closed
vadmium opened this issue Jan 13, 2015 · 8 comments
Closed
Assignees
Labels
3.7 (EOL) end of life docs Documentation in the Doc dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@vadmium
Copy link
Member

vadmium commented Jan 13, 2015

BPO 23231
Nosy @malemburg, @doerwalter, @vstinner, @ezio-melotti, @bitdancer, @vadmium, @serhiy-storchaka
Files
  • final-no-object.patch
  • final-no-object.ignore-space.diff: diff --ignore-all-space
  • iter-unsupported.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/vadmium'
    closed_at = <Date 2016-10-15.01:37:36.979>
    created_at = <Date 2015-01-13.12:48:19.265>
    labels = ['type-bug', '3.7', 'expert-unicode', 'docs']
    title = 'Fix codecs.iterencode/decode() by allowing data parameter to be omitted'
    updated_at = <Date 2016-10-15.01:37:36.977>
    user = 'https://github.com/vadmium'

    bugs.python.org fields:

    activity = <Date 2016-10-15.01:37:36.977>
    actor = 'martin.panter'
    assignee = 'martin.panter'
    closed = True
    closed_date = <Date 2016-10-15.01:37:36.979>
    closer = 'martin.panter'
    components = ['Documentation', 'Unicode']
    creation = <Date 2015-01-13.12:48:19.265>
    creator = 'martin.panter'
    dependencies = []
    files = ['37691', '37692', '44164']
    hgrepos = []
    issue_num = 23231
    keywords = ['patch']
    message_count = 8.0
    messages = ['233932', '233933', '234206', '256746', '273101', '273198', '273203', '278678']
    nosy_count = 8.0
    nosy_names = ['lemburg', 'doerwalter', 'vstinner', 'ezio.melotti', 'r.david.murray', 'python-dev', 'martin.panter', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue23231'
    versions = ['Python 3.5', 'Python 3.6', 'Python 3.7']

    @vadmium
    Copy link
    Member Author

    vadmium commented Jan 13, 2015

    As mentioned in bpo-20132, iterencode() and iterdecode() only work on text-to-byte codecs, because they assume particular data types when finalizing the incremental codecs. This patch changes the signature of the IncrementalEncoder and IncrementalDecoder methods from

    IncrementalEncoder.encode(object[, final])
    IncrementalEncoder.decode(object[, final])

    to

    IncrementalEncoder.encode([object,] [final])
    IncrementalEncoder.decode([object,] [final])

    so that iteren/decode(), and perhaps in the future, StreamWriter/Reader, can operate the incremental codec without knowing what kind of data should be processed.

    @vadmium vadmium added stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error labels Jan 13, 2015
    @vadmium
    Copy link
    Member Author

    vadmium commented Jan 13, 2015

    Original patch has lots of whitespace changes, probably due to generated codec code not being regenerated for a long time. This diff ignores the space changes, so should be easier to review.

    @vadmium
    Copy link
    Member Author

    vadmium commented Jan 18, 2015

    Another idea that doesn’t involve changing the incremental codec APIs is kind of described in <https://bugs.python.org/issue7475#msg145986\>: to add format parameters to iterencode() and iterdecode(), which would allow it to determine the right data type to finalize the codecs with.

    @serhiy-storchaka serhiy-storchaka self-assigned this Feb 28, 2015
    @serhiy-storchaka
    Copy link
    Member

    The patch changes public interface. This breaks compatibility with third-party codecs implementing it.

    We have found other solution to iterencode/iterdecode problem. For example we can buffer iterated values and encode with one step delay:

        prev = sentinel = object()
        for input in iterator:
            if prev is not sentinel:
                output = encoder.encode(prev)
                if output:
                    yield output
            prev = input
        if prev is not sentinel:
            output = encoder.encode(prev, True)
            if output:
                yield output

    Or remember the previous value and use it to calculate the empty value at the end (works only if input type supports slicing):

        prev = sentinel = object()
        for input in iterator:
            output = encoder.encode(input)
            if output:
                yield output
            prev = input
        if prev is not sentinel:
            output = encoder.encode(prev[:0], True)
            if output:
                yield output

    @vadmium
    Copy link
    Member Author

    vadmium commented Aug 19, 2016

    Serhiy’s two proposals won’t work for codecs that include non-zero output for zero input:

    >>> tuple(iterencode((), "utf-8-sig"))
    (b'\xef\xbb\xbf',)
    >>> encode(b"", "uu")
    b'begin 666 <data>\n \nend\n'
    >>> encode(b"", "zlib")
    b'x\x9c\x03\x00\x00\x00\x00\x01'

    However I agree that changing the incremental codec APIs is not ideal. Since nobody seems to care that much, it might be simpler to document that:

    • iterencode() only works where text str objects can be encoded, so base64-codec is not supported, but rot13-codec is supported
    • iterdecode() only works where bytes objects can be decoded, so rot13-codec is not supported, but base64-codec should be supported (pending other aspects of bpo-20132)

    @vadmium
    Copy link
    Member Author

    vadmium commented Aug 20, 2016

    Here is my documentation proposal.

    @vadmium vadmium added docs Documentation in the Doc dir and removed stdlib Python modules in the Lib dir labels Aug 20, 2016
    @serhiy-storchaka
    Copy link
    Member

    it might be simpler to document that

    Agreed.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 15, 2016

    New changeset 402eba63650c by Martin Panter in branch '3.5':
    Issue bpo-23231: Document codecs.iterencode(), iterdecode() shortcomings
    https://hg.python.org/cpython/rev/402eba63650c

    New changeset 0837940bcb9f by Martin Panter in branch '3.6':
    Issue bpo-23231: Merge codecs doc from 3.5 into 3.6
    https://hg.python.org/cpython/rev/0837940bcb9f

    New changeset 1955dcc27332 by Martin Panter in branch 'default':
    Issue bpo-23231: Merge codecs doc from 3.6
    https://hg.python.org/cpython/rev/1955dcc27332

    @vadmium vadmium added the 3.7 (EOL) end of life label Oct 15, 2016
    @vadmium vadmium closed this as completed Oct 15, 2016
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life docs Documentation in the Doc dir topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants