Message 152310 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	kennyluck
Recipients	ezio.melotti, gvanrossum, kennyluck, lemburg, loewis, tchrist, vstinner
Date	2012-01-30.07:29:57
SpamBayes Score	2.7755576e-16
Marked as misclassified	No
Message-id	<1327908600.25.0.550432988702.issue12892@psf.upfronthosting.co.za>
In-reply-to

Content
Attached patch does the following beyond what the patch from haypo does: * call the error handler * reject 0xd800~0xdfff when decoding utf-32 The followings are on my TODO list, although this patch doesn't depend on any of these and can be reviewed and landed separately: * make the surrogatepass error handler work for utf-16 and utf-32. (I should be able to finish this by today) * fix an error in the error handler for utf-16-le. (In, Python3.2 b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" instead of "A" for some reason) * make unicode_encode_call_errorhandler return bytes so that we can simplify this patch. (This arguably belongs to a separate bug so I'll file it when needed) > All UTF codecs should reject lone surrogates in strict error mode, Should we really reject lone surrogates for UTF-7? There's a test in test_codecs.py that tests "\udc80" to be encoded b"+3IA-" (. Given that UTF-7 is not really part of the Unicode Standard and it is more like a "data encoding" than a "text encoding" to me, I am not sure it's a good idea. > but let them pass using the surrogatepass error handler (the UTF-8 > codec already does) and apply the usual error handling for ignore > and replace. For 'replace', the patch now emits b"\x00?" instead of b"?" so that UTF-16 stream doesn't get corrupted. It is not "usual" and not matching # Implements the ``replace`` error handling: malformed data is replaced # with a suitable replacement character such as ``'?'`` in bytestrings # and ``'\ufffd'`` in Unicode strings. in the documentation. What do we do? Are there other encodings that are not ASCII compatible besides UTF-7, UTF-16 and UTF-32 that Python supports? I think it would be better to use encoded U+fffd whenever possible and fall back to '?'. What do you think? Some other self comments on my patch: * In the STORECHAR macro for utf-16 and utf-32, I change all instances of "ch & 0xFF" to (unsigned char) ch. I don't have enough C knowledge to know if this is actually better or if this makes any difference at all. * The code for utf-16 and utf-32 are duplicates of the uft-8 one. That one's complexity comes from issue #8092 . Not sure if there are ways to simplify these. For example, are there suitable functions there so that we don't need to check integer overflow at these places?

Attached patch does the following beyond what the patch from haypo does:
  * call the error handler
  * reject 0xd800~0xdfff when decoding utf-32

The followings are on my TODO list, although this patch doesn't depend on any of these and can be reviewed and landed separately:
  * make the surrogatepass error handler work for utf-16 and utf-32. (I should be able to finish this by today)
  * fix an error in the error handler for utf-16-le. (In, Python3.2 b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" instead of "A" for some reason)
  * make unicode_encode_call_errorhandler return bytes so that we can simplify this patch. (This arguably belongs to a separate bug so I'll file it when needed)

> All UTF codecs should reject lone surrogates in strict error mode,

Should we really reject lone surrogates for UTF-7? There's a test in test_codecs.py that tests "\udc80" to be encoded b"+3IA-" (. Given that UTF-7 is not really part of the Unicode Standard and it is more like a "data encoding" than a "text encoding" to me, I am not sure it's a good idea.

> but let them pass using the surrogatepass error handler (the UTF-8
> codec already does) and apply the usual error handling for ignore
> and replace.

For 'replace', the patch now emits b"\x00?" instead of b"?" so that UTF-16 stream doesn't get corrupted. It is not "usual" and not matching

  # Implements the ``replace`` error handling: malformed data is replaced
  # with a suitable replacement character such as ``'?'`` in bytestrings 
  # and ``'\ufffd'`` in Unicode strings.

in the documentation. What do we do? Are there other encodings that are not ASCII compatible besides UTF-7, UTF-16 and UTF-32 that Python supports? I think it would be better to use encoded U+fffd whenever possible and fall back to '?'. What do you think?

Some other self comments on my patch:
  * In the STORECHAR macro for utf-16 and utf-32, I change all instances of "ch & 0xFF" to (unsigned char) ch. I don't have enough C knowledge to know if this is actually better or if this makes any difference at all.
  * The code for utf-16 and utf-32 are duplicates of the uft-8 one. That one's complexity comes from issue #8092 . Not sure if there are ways to simplify these. For example, are there suitable functions there so that we don't need to check integer overflow at these places?

History
Date	User	Action	Args
2012-01-30 07:30:01	kennyluck	set	recipients: + kennyluck, lemburg, gvanrossum, loewis, vstinner, ezio.melotti, tchrist
2012-01-30 07:30:00	kennyluck	set	messageid: <1327908600.25.0.550432988702.issue12892@psf.upfronthosting.co.za>
2012-01-30 07:29:59	kennyluck	link	issue12892 messages
2012-01-30 07:29:58	kennyluck	create