classification
Title: Broken error handling in codecs.unicode_escape_decode()
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.4, Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: 16980 Superseder:
Assigned To: serhiy.storchaka Nosy List: ezio.melotti, python-dev, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2013-01-16 10:46 by serhiy.storchaka, last changed 2013-01-29 09:50 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
unicode_escape_decode_error_handling-2.7.patch serhiy.storchaka, 2013-01-25 22:58 review
unicode_escape_decode_error_handling-3.2.patch serhiy.storchaka, 2013-01-25 22:58 review
unicode_escape_decode_error_handling-3.3.patch serhiy.storchaka, 2013-01-25 22:58 review
unicode_escape_decode_error_handling-3.4.patch serhiy.storchaka, 2013-01-25 22:58 review
Messages (7)
msg180077 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-16 10:46
An error handler in unicode_escape_decode() eats at least one byte (or more) after illegal escape sequence.

>>> import codecs
>>> codecs.unicode_escape_decode(br'\u!@#', 'replace')
('�', 5)
>>> codecs.unicode_escape_decode(br'\u!@#$', 'replace')
('�@#$', 6)

raw_unicode_escape_decode() works right:

>>> codecs.raw_unicode_escape_decode(br'\u!@#', 'replace')
('�!@#', 5)
>>> codecs.raw_unicode_escape_decode(br'\u!@#$', 'replace')
('�!@#$', 6)

See also issue16975.
msg180091 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-16 14:50
Here is a patch for 3.4. Patches for other versions will be different a lot.
msg180634 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-25 22:58
Here is a set of patches for all versions (patch for 3.4 updated).
msg180857 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-28 14:20
Ezio, is it a good factorization?

    def check(self, coder):
        def checker(input, expect):
            self.assertEqual(coder(input), (expect, len(input)))
        return checker

    def test_escape_decode(self):
        decode = codecs.unicode_escape_decode
        check = self.check(decode)
        check(b"[\\\n]", "[]")
        check(br'[\"]', '["]')
        check(br"[\']", "[']")
        # other 20 checks ...

And same for test_escape_encode and for bytes escape decoder.
msg180890 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-01-29 00:33
LGTM.
If you want to push it even further you could make a list of (input, expected) and call the check() in a loop.  That way it will also be easier to refactor if/when we add subtests (#16997).
msg180896 - (view) Author: Roundup Robot (python-dev) Date: 2013-01-29 08:53
New changeset a242ac99161f by Serhiy Storchaka in branch '2.7':
Issue #16979: Fix error handling bugs in the unicode-escape-decode decoder.
http://hg.python.org/cpython/rev/a242ac99161f

New changeset 084bec5443d6 by Serhiy Storchaka in branch '3.2':
Issue #16979: Fix error handling bugs in the unicode-escape-decode decoder.
http://hg.python.org/cpython/rev/084bec5443d6

New changeset 086defaf16fe by Serhiy Storchaka in branch '3.3':
Issue #16979: Fix error handling bugs in the unicode-escape-decode decoder.
http://hg.python.org/cpython/rev/086defaf16fe

New changeset 218da678bb8b by Serhiy Storchaka in branch 'default':
Issue #16979: Fix error handling bugs in the unicode-escape-decode decoder.
http://hg.python.org/cpython/rev/218da678bb8b
msg180897 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-29 09:48
Until subtests added an explicit call looks better to me. And when subtests will be added we will just add subtest inside the helper function.
History
Date User Action Args
2013-01-29 09:50:51serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2013-01-29 09:48:12serhiy.storchakasetmessages: + msg180897
2013-01-29 08:53:08python-devsetnosy: + python-dev
messages: + msg180896
2013-01-29 00:33:37ezio.melottisetmessages: + msg180890
2013-01-28 14:20:52serhiy.storchakasetmessages: + msg180857
2013-01-25 22:58:19serhiy.storchakasetfiles: + unicode_escape_decode_error_handling-2.7.patch, unicode_escape_decode_error_handling-3.2.patch, unicode_escape_decode_error_handling-3.3.patch, unicode_escape_decode_error_handling-3.4.patch

messages: + msg180634
2013-01-25 22:55:26serhiy.storchakasetfiles: - unicode_escape_decode_error_handling-3.4.patch
2013-01-16 14:50:05serhiy.storchakasetfiles: + unicode_escape_decode_error_handling-3.4.patch
messages: + msg180091

dependencies: + SystemError in codecs.unicode_escape_decode()
keywords: + patch
stage: needs patch -> patch review
2013-01-16 10:46:45serhiy.storchakacreate