classification
Title: Reading UTF-16 with codecs.readline() breaks on surrogate pairs
Type: behavior Stage: test needed
Components: Library (Lib), Unicode Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, ezio.melotti, haypo, ply
Priority: normal Keywords: patch

Created on 2011-03-10 10:19 by ply, last changed 2011-03-11 00:22 by pitrou.

Files
File name Uploaded Description Edit
testutf16.py ply, 2011-03-10 10:19 Error reproducing script
partial_utf16.patch amaury.forgeotdarc, 2011-03-10 12:19 review
Messages (2)
msg130498 - (view) Author: Yuriy Pilgun (ply) Date: 2011-03-10 10:19
Reading UTF-16 text file with module 'codecs' fails, if surrogate pair is located at 72-character boundary.

Attached python script fails with message:
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 70-71: unexpected end of data

The reason is splitting of input data for readline() into chunks, namely
  readsize = size or 72
msg130504 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2011-03-10 12:19
The utf16 incremental codec does not like incomplete surrogate pairs.
Patch attached.
I also plan to refactor all the test_partial() functions of test_codecs, to give them a common implementation.
History
Date User Action Args
2011-03-11 00:22:25pitrousetnosy: + haypo
2011-03-10 12:19:30amaury.forgeotdarcsetfiles: + partial_utf16.patch

nosy: + amaury.forgeotdarc
messages: + msg130504

keywords: + patch
2011-03-10 10:23:42ezio.melottisetnosy: + ezio.melotti

stage: test needed
2011-03-10 10:19:57plycreate