Message152310
Attached patch does the following beyond what the patch from haypo does:
* call the error handler
* reject 0xd800~0xdfff when decoding utf-32
The followings are on my TODO list, although this patch doesn't depend on any of these and can be reviewed and landed separately:
* make the surrogatepass error handler work for utf-16 and utf-32. (I should be able to finish this by today)
* fix an error in the error handler for utf-16-le. (In, Python3.2 b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" instead of "A" for some reason)
* make unicode_encode_call_errorhandler return bytes so that we can simplify this patch. (This arguably belongs to a separate bug so I'll file it when needed)
> All UTF codecs should reject lone surrogates in strict error mode,
Should we really reject lone surrogates for UTF-7? There's a test in test_codecs.py that tests "\udc80" to be encoded b"+3IA-" (. Given that UTF-7 is not really part of the Unicode Standard and it is more like a "data encoding" than a "text encoding" to me, I am not sure it's a good idea.
> but let them pass using the surrogatepass error handler (the UTF-8
> codec already does) and apply the usual error handling for ignore
> and replace.
For 'replace', the patch now emits b"\x00?" instead of b"?" so that UTF-16 stream doesn't get corrupted. It is not "usual" and not matching
# Implements the ``replace`` error handling: malformed data is replaced
# with a suitable replacement character such as ``'?'`` in bytestrings
# and ``'\ufffd'`` in Unicode strings.
in the documentation. What do we do? Are there other encodings that are not ASCII compatible besides UTF-7, UTF-16 and UTF-32 that Python supports? I think it would be better to use encoded U+fffd whenever possible and fall back to '?'. What do you think?
Some other self comments on my patch:
* In the STORECHAR macro for utf-16 and utf-32, I change all instances of "ch & 0xFF" to (unsigned char) ch. I don't have enough C knowledge to know if this is actually better or if this makes any difference at all.
* The code for utf-16 and utf-32 are duplicates of the uft-8 one. That one's complexity comes from issue #8092 . Not sure if there are ways to simplify these. For example, are there suitable functions there so that we don't need to check integer overflow at these places? |
|
Date |
User |
Action |
Args |
2012-01-30 07:30:01 | kennyluck | set | recipients:
+ kennyluck, lemburg, gvanrossum, loewis, vstinner, ezio.melotti, tchrist |
2012-01-30 07:30:00 | kennyluck | set | messageid: <1327908600.25.0.550432988702.issue12892@psf.upfronthosting.co.za> |
2012-01-30 07:29:59 | kennyluck | link | issue12892 messages |
2012-01-30 07:29:58 | kennyluck | create | |
|