classification
Title: UTF-16 incremental decoder doesn't support partial surrogate pair
Type: behavior Stage: resolved
Components: Library (Lib), Unicode Versions: Python 3.4, Python 3.3, Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: amaury.forgeotdarc, ezio.melotti, haypo, ply, python-dev, serhiy.storchaka
Priority: normal Keywords: needs review, patch

Created on 2011-03-10 10:19 by ply, last changed 2013-01-08 21:49 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
testutf16.py ply, 2011-03-10 10:19 Error reproducing script
partial_utf16.patch amaury.forgeotdarc, 2011-03-10 12:19 review
partial_utf16-3.3.patch serhiy.storchaka, 2012-09-27 13:29 Patch for 3.3 review
Messages (4)
msg130498 - (view) Author: Yuriy Pilgun (ply) Date: 2011-03-10 10:19
Reading UTF-16 text file with module 'codecs' fails, if surrogate pair is located at 72-character boundary.

Attached python script fails with message:
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 70-71: unexpected end of data

The reason is splitting of input data for readline() into chunks, namely
  readsize = size or 72
msg130504 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2011-03-10 12:19
The utf16 incremental codec does not like incomplete surrogate pairs.
Patch attached.
I also plan to refactor all the test_partial() functions of test_codecs, to give them a common implementation.
msg171373 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-09-27 13:29
In issue14624 utf-16 decoder has been significantly reworked. Here is adapted for 3.3 patch.
msg179375 - (view) Author: Roundup Robot (python-dev) Date: 2013-01-08 21:47
New changeset f2353e74b335 by Serhiy Storchaka in branch '2.7':
Issue #11461: Fix the incremental UTF-16 decoder. Original patch by
http://hg.python.org/cpython/rev/f2353e74b335

New changeset 4677c5f6fcf7 by Serhiy Storchaka in branch '3.2':
Issue #11461: Fix the incremental UTF-16 decoder. Original patch by
http://hg.python.org/cpython/rev/4677c5f6fcf7

New changeset eed1883b1974 by Serhiy Storchaka in branch '3.3':
Issue #11461: Fix the incremental UTF-16 decoder. Original patch by
http://hg.python.org/cpython/rev/eed1883b1974

New changeset 5e84d020d001 by Serhiy Storchaka in branch 'default':
Issue #11461: Fix the incremental UTF-16 decoder. Original patch by
http://hg.python.org/cpython/rev/5e84d020d001
History
Date User Action Args
2013-01-08 21:49:54serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2013-01-08 21:47:35python-devsetnosy: + python-dev
messages: + msg179375
2013-01-07 17:55:37serhiy.storchakasetassignee: serhiy.storchaka
2013-01-07 17:54:49serhiy.storchakalinkissue15278 superseder
2012-09-27 13:29:41serhiy.storchakasetfiles: + partial_utf16-3.3.patch

nosy: + serhiy.storchaka
messages: + msg171373

keywords: + needs review
2012-09-26 20:07:36hayposettitle: Reading UTF-16 with codecs.readline() breaks on surrogate pairs -> UTF-16 incremental decoder doesn't support partial surrogate pair
2012-09-26 20:06:08hayposetversions: + Python 3.2, Python 3.3, Python 3.4
2012-09-26 17:27:59ezio.melottisetstage: test needed -> patch review
2011-03-11 00:22:25pitrousetnosy: + haypo
2011-03-10 12:19:30amaury.forgeotdarcsetfiles: + partial_utf16.patch

nosy: + amaury.forgeotdarc
messages: + msg130504

keywords: + patch
2011-03-10 10:23:42ezio.melottisetnosy: + ezio.melotti

stage: test needed
2011-03-10 10:19:57plycreate