Message 248980 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	serhiy.storchaka
Recipients	ezio.melotti, lemburg, loewis, serhiy.storchaka, vstinner
Date	2015-08-21.20:52:30
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1440190360.57.0.213725071437.issue24848@psf.upfronthosting.co.za>
In-reply-to

Content
There is a reason for behavior in case 2. This is likely a truncated data and it is safer to raise an exception than silently produce lone surrogate. Current UTF-7 encoder always adds '-' after ending shift sequence. I suppose this is not a bug. However there are yet three bugs. 4. Decoder can emit lone low surrogate before replacement character in case of error. >>> b'+2DTdI-'.decode('utf-7', 'replace') '\ud834�' A low surrogate is a part of incomplete astral character and shouldn't emitted in case of error in encoded astral character. 5. According to RFC 2152: "A "+" character followed immediately by any character other than members of set B or "-" is an ill-formed sequence." But this is accepted by current decoder as empty shift sequence that is decoded to empty string. >>> b'a+,b'.decode('utf-7') 'a,b' >>> b'a+'.decode('utf-7') 'a' 6. Replacement character '\ufffd' can be replaced with character 'ý' ('\xfd'): >>> b'\xff'.decode('utf-7', 'replace') '�' >>> b'a\xff'.decode('utf-7', 'replace') 'a�' >>> b'a\xffb'.decode('utf-7', 'replace') 'a�b' >>> b'\xffb'.decode('utf-7', 'replace') 'ýb' This bug is reproduced only in 3.4+. Following patch fixes bugs 1 and 4 and adds more tests. Corner cases 2 and 3 are likely not bugs. I doubt about fixing bug 5. iconv accepts such ill-formed sequences. In any case I think the fix of this bug can be applied only for default branch. I have no idea how to fix bug 6. I afraid it can be a bug in _PyUnicodeWriter and therefore can affect other decoders.

There is a reason for behavior in case 2. This is likely a truncated data and it is safer to raise an exception than silently produce lone surrogate. Current UTF-7 encoder always adds '-' after ending shift sequence. I suppose this is not a bug.

However there are yet three bugs.

4. Decoder can emit lone low surrogate before replacement character in case of error.

>>> b'+2DTdI-'.decode('utf-7', 'replace')
'\ud834�'

A low surrogate is a part of incomplete astral character and shouldn't emitted in case of error in encoded astral character.

5. According to RFC 2152: "A "+" character followed immediately by any character other than members of set B or "-" is an ill-formed sequence." But this is accepted by current decoder as empty shift sequence that is decoded to empty string.

>>> b'a+,b'.decode('utf-7')
'a,b'
>>> b'a+'.decode('utf-7')
'a'

6. Replacement character '\ufffd' can be replaced with character 'ý' ('\xfd'):

>>> b'\xff'.decode('utf-7', 'replace')
'�'
>>> b'a\xff'.decode('utf-7', 'replace')
'a�'
>>> b'a\xffb'.decode('utf-7', 'replace')
'a�b'
>>> b'\xffb'.decode('utf-7', 'replace')
'ýb'

This bug is reproduced only in 3.4+.

Following patch fixes bugs 1 and 4 and adds more tests.

Corner cases 2 and 3 are likely not bugs.

I doubt about fixing bug 5. iconv accepts such ill-formed sequences. In any case I think the fix of this bug can be applied only for default branch.

I have no idea how to fix bug 6. I afraid it can be a bug in _PyUnicodeWriter and therefore can affect other decoders.

History
Date	User	Action	Args
2015-08-21 20:52:43	serhiy.storchaka	set	recipients: + serhiy.storchaka, lemburg, loewis, vstinner, ezio.melotti
2015-08-21 20:52:40	serhiy.storchaka	set	messageid: <1440190360.57.0.213725071437.issue24848@psf.upfronthosting.co.za>
2015-08-21 20:52:39	serhiy.storchaka	link	issue24848 messages
2015-08-21 20:52:36	serhiy.storchaka	create