New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix codecs.iterencode/decode() by allowing data parameter to be omitted #67420
Comments
As mentioned in bpo-20132, iterencode() and iterdecode() only work on text-to-byte codecs, because they assume particular data types when finalizing the incremental codecs. This patch changes the signature of the IncrementalEncoder and IncrementalDecoder methods from IncrementalEncoder.encode(object[, final])
IncrementalEncoder.decode(object[, final]) to IncrementalEncoder.encode([object,] [final])
IncrementalEncoder.decode([object,] [final]) so that iteren/decode(), and perhaps in the future, StreamWriter/Reader, can operate the incremental codec without knowing what kind of data should be processed. |
Original patch has lots of whitespace changes, probably due to generated codec code not being regenerated for a long time. This diff ignores the space changes, so should be easier to review. |
Another idea that doesn’t involve changing the incremental codec APIs is kind of described in <https://bugs.python.org/issue7475#msg145986\>: to add format parameters to iterencode() and iterdecode(), which would allow it to determine the right data type to finalize the codecs with. |
The patch changes public interface. This breaks compatibility with third-party codecs implementing it. We have found other solution to iterencode/iterdecode problem. For example we can buffer iterated values and encode with one step delay: prev = sentinel = object()
for input in iterator:
if prev is not sentinel:
output = encoder.encode(prev)
if output:
yield output
prev = input
if prev is not sentinel:
output = encoder.encode(prev, True)
if output:
yield output Or remember the previous value and use it to calculate the empty value at the end (works only if input type supports slicing): prev = sentinel = object()
for input in iterator:
output = encoder.encode(input)
if output:
yield output
prev = input
if prev is not sentinel:
output = encoder.encode(prev[:0], True)
if output:
yield output |
Serhiy’s two proposals won’t work for codecs that include non-zero output for zero input: >>> tuple(iterencode((), "utf-8-sig"))
(b'\xef\xbb\xbf',)
>>> encode(b"", "uu")
b'begin 666 <data>\n \nend\n'
>>> encode(b"", "zlib")
b'x\x9c\x03\x00\x00\x00\x00\x01' However I agree that changing the incremental codec APIs is not ideal. Since nobody seems to care that much, it might be simpler to document that:
|
Here is my documentation proposal. |
Agreed. |
New changeset 402eba63650c by Martin Panter in branch '3.5': New changeset 0837940bcb9f by Martin Panter in branch '3.6': New changeset 1955dcc27332 by Martin Panter in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: