Message 172558 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	doerwalter
Recipients	Marcus.Gröber, doerwalter, ezio.melotti, lovelylain, serhiy.storchaka, vstinner
Date	2012-10-10.09:49:29
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1349862569.84.0.138655370126.issue15278@psf.upfronthosting.co.za>
In-reply-to

Content
> >>> codecs.utf_8_decode('\u20ac'.encode('utf8')[:2]) > ('', 0) > > Oh... codecs.CODEC_decode are incremental decoders? I misunderstood completly this. No, those function are not decoders, they're just helper functions used to implement the real incremental decoders. That's why they're undocumented. Whether codecs.utf_8_decode() returns partial results or raises an exception depends on the final argument:: >>> s = '\u20ac'.encode('utf8')[:2] >>> codecs.utf_8_decode(s, 'strict') ('', 0) >>> codecs.utf_8_decode(s, 'strict', False) ('', 0) >>> codecs.utf_8_decode(s, 'strict', True) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data If you look at encodings/utf_8.py you see that the stateless decoder call codecs.utf_8_decode() with final==True:: def decode(input, errors='strict'): return codecs.utf_8_decode(input, errors, True) so the stateless decoder will raise exceptions for partial results. The incremental decoder simply passed on the final argument given to its encode() method.

> >>> codecs.utf_8_decode('\u20ac'.encode('utf8')[:2])
> ('', 0)
>
> Oh... codecs.CODEC_decode are incremental decoders? I misunderstood completly this.

No, those function are not decoders, they're just helper functions used to implement the real incremental decoders. That's why they're undocumented.

Whether codecs.utf_8_decode() returns partial results or raises an exception depends on the final argument::

>>> s = '\u20ac'.encode('utf8')[:2]
>>> codecs.utf_8_decode(s, 'strict')
('', 0)
>>> codecs.utf_8_decode(s, 'strict', False)
('', 0)
>>> codecs.utf_8_decode(s, 'strict', True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

If you look at encodings/utf_8.py you see that the stateless decoder call codecs.utf_8_decode() with final==True::

    def decode(input, errors='strict'):
        return codecs.utf_8_decode(input, errors, True)

so the stateless decoder *will* raise exceptions for partial results. The incremental decoder simply passed on the final argument given to its encode() method.

History
Date	User	Action	Args
2012-10-10 09:49:29	doerwalter	set	recipients: + doerwalter, vstinner, ezio.melotti, serhiy.storchaka, lovelylain, Marcus.Gröber
2012-10-10 09:49:29	doerwalter	set	messageid: <1349862569.84.0.138655370126.issue15278@psf.upfronthosting.co.za>
2012-10-10 09:49:29	doerwalter	link	issue15278 messages
2012-10-10 09:49:29	doerwalter	create