Message 145894 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	amaury.forgeotdarc, loewis, vstinner
Date	2011-10-19.08:15:55
SpamBayes Score	0.0005632013
Marked as misclassified	No
Message-id	<2373475.cDgd2V1OPN@dsk000552>
In-reply-to	<1319010824.1.0.189730635035.issue13216@psf.upfronthosting.co.za>

Content
> We shouldn't use the MS codec if we have our own, as they may differ. Ok, I agree. MS codec has a nice replacement behaviour (search for a similar glyph): cp1252 encodes Ł to b'L' for example. Our codec raises a UnicodeEncodeError on u'\u0141'.encode('cp1252'). > As for the 65001 bug: is that actually solved by this codec? Sorry, which bug? See tests using CP_UTF8 in test_codecs. Depending on the Windows version, you don't get the same behaviour on surrogates. Before Windows Vista, surrogates were always encoded, whereas you can now choose the behaviour using the Python error handler: if self.vista_or_later(): tests.append(('\udc80', 'strict', None)) # None=UnicodeEncodeError tests.append(('\udc80', 'ignore', b'')) tests.append(('\udc80', 'replace', b'?')) else: tests.append(('\udc80', 'strict', b'\xed\xb2\x80'))

> We shouldn't use the MS codec if we have our own, as they may differ.

Ok, I agree. MS codec has a nice replacement behaviour (search for a similar 
glyph): cp1252 encodes Ł to b'L' for example. Our codec raises a 
UnicodeEncodeError on u'\u0141'.encode('cp1252').

> As for the 65001 bug: is that actually solved by this codec?

Sorry, which bug?

See tests using CP_UTF8 in test_codecs. Depending on the Windows version, you 
don't get the same behaviour on surrogates. Before Windows Vista, surrogates 
were always encoded, whereas you can now choose the behaviour using the Python 
error handler:

        if self.vista_or_later():
            tests.append(('\udc80', 'strict', None)) # None=UnicodeEncodeError
            tests.append(('\udc80', 'ignore', b''))
            tests.append(('\udc80', 'replace', b'?'))
        else:
            tests.append(('\udc80', 'strict', b'\xed\xb2\x80'))

History
Date	User	Action	Args
2011-10-19 08:15:56	vstinner	set	recipients: + vstinner, loewis, amaury.forgeotdarc
2011-10-19 08:15:55	vstinner	link	issue13216 messages
2011-10-19 08:15:55	vstinner	create