Message207374
Many of the incremental codecs do not handle fragmented data very well. In the past I think I was interested in using the Base-64 and Quoted-printable codecs, and playing with other codecs today reveals many more issues. A lot of the issues reflect missing functionality, so maybe the simplest solution would be to document the codecs that don’t work.
Incremental decoding issues:
>>> str().join(codecs.iterdecode(iter((b"\\", b"u2013")), "unicode-escape"))
UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 0: \ at end of string
# Same deal for raw-unicode-escape.
>>> bytes().join(codecs.iterdecode(iter((b"3", b"3")), "hex-codec"))
binascii.Error: Odd-length string
>>> bytes().join(codecs.iterdecode(iter((b"A", b"A==")), "base64-codec"))
binascii.Error: Incorrect padding
>>> bytes().join(codecs.iterdecode(iter((b"=", b"3D")), "quopri-codec"))
b'3D' # Should return b"="
>>> codecs.getincrementaldecoder("uu-codec")().decode(b"begin ")
ValueError: Truncated input data
Incremental encoding issues:
>>> e = codecs.getincrementalencoder("base64-codec")(); codecs.decode(e.encode(b"1") + e.encode(b"2", final=True), "base64-codec")
b'1' # Should be b"12"
>>> e = codecs.getincrementalencoder("quopri-codec")(); e.encode(b"1" * 50) + e.encode(b"2" * 50, final=True)
b'1111111111111111111111111111111111111111111111111122222222222222222222222222222222222222222222222222'
# I suspect the line should have been split in two
>>> e = codecs.getincrementalencoder("uu-codec")(); codecs.decode(e.encode(b"1") + e.encode(b"2", final=True), "uu-codec")
b'1' # Should be b"12"
I also noticed iterdecode() does not work with “uu-codec”:
>>> bytes().join(codecs.iterdecode(iter((b"begin 666 <data>\n \nend\n",)), "uu-codec"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.3/codecs.py", line 1032, in iterdecode
output = decoder.decode(b"", True)
File "/usr/lib/python3.3/encodings/uu_codec.py", line 80, in decode
return uu_decode(input, self.errors)[0]
File "/usr/lib/python3.3/encodings/uu_codec.py", line 45, in uu_decode
raise ValueError('Missing "begin" line in input data')
ValueError: Missing "begin" line in input data
And iterencode() does not work with any of the byte encoders, because it does not know what kind of empty string to pass to IncrementalEncoder.encode(final=True):
>>> bytes().join(codecs.iterencode(iter(()), "base64-codec"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.3/codecs.py", line 1014, in iterencode
output = encoder.encode("", True)
File "/usr/lib/python3.3/encodings/base64_codec.py", line 31, in encode
return base64.encodebytes(input)
File "/usr/lib/python3.3/base64.py", line 343, in encodebytes
raise TypeError("expected bytes, not %s" % s.__class__.__name__)
TypeError: expected bytes, not str
Finally, incremental UTF-7 encoding is suboptimal, and the decoder seems to buffer unlimited data, both defeating the purpose of using an incremental codec:
>>> bytes().join(codecs.iterencode("\xA9" * 2, "utf-7"))
b'+AKk-+AKk-' # b"+AKkAqQ-" would be better
>>> d = codecs.getincrementaldecoder("utf-7")()
>>> d.decode(b"+")
b''
>>> any(d.decode(b"AAAAAAAA" * 100000) for _ in range(10))
False # No data returned: everything must be buffered
>>> d.decode(b"-") == "\x00" * 3000000
True # Returned all buffered data in one go |
|
Date |
User |
Action |
Args |
2014-01-05 13:48:58 | martin.panter | set | recipients:
+ martin.panter |
2014-01-05 13:48:58 | martin.panter | set | messageid: <1388929738.44.0.851837204966.issue20132@psf.upfronthosting.co.za> |
2014-01-05 13:48:58 | martin.panter | link | issue20132 messages |
2014-01-05 13:48:57 | martin.panter | create | |
|