This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: reading UTF16-encoded text file crashes if \r on 64-char boundary
Type: crash Stage:
Components: Interpreter Core Versions: Python 3.0
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: pitrou Nosy List: gregory.p.smith, pitrou, sjmachin, vstinner
Priority: normal Keywords: patch

Created on 2008-12-07 11:00 by sjmachin, last changed 2022-04-11 14:56 by admin. This issue is now closed.

File name Uploaded Description Edit sjmachin, 2008-12-07 11:00 This stand-alone script illustrates the problems. vstinner, 2008-12-09 00:25
test_io.patch vstinner, 2008-12-09 01:53
incremental_newline_decoder-2.patch vstinner, 2008-12-09 02:08
utf16_newlines.patch pitrou, 2008-12-13 18:14
utf16_newlines2.patch pitrou, 2008-12-13 18:30
Messages (11)
msg77219 - (view) Author: John Machin (sjmachin) Date: 2008-12-07 11:00
Problem in the newline handling in, class
IncrementalNewlineDecoder, method decode. It reads text files in 128-
byte chunks. Converting CR LF to \n requires special case handling
when '\r' is detected at the end of the decoded chunk in case
there's an LF at the start of the next chunk. It prepends b'\r' (only 1
byte) to the next chunk's raw bytes and decodes that. But \r in UTF-16
takes 2 bytes; we are now 1 byte out of kilter and various failures are
possible (including silently producing garbage output from a truncated
file with an odd number of bytes).

The attached script illustrates the problems.
msg77377 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-12-09 00:20
The bug is in IncrementalNewlineDecoder, not in the codec nor 
msg77378 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-12-09 00:24
Smaller example to demonstrate the problem.
msg77382 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-12-09 01:53
Here is a patch for check the problem by adding new 
encodings to TextIOWrapperTest.testNewlines().
msg77384 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-12-09 02:08
Ugly patch to fix this issue:
 - add more regression tests for charsets UTF-16*, UTF-32*
 - add mandatory argument "encoding" to io.IncrementalNewlineDecoder 
constructor => BREAK THE API
 - use the encoding the encode "\r"
 - most ulgy hack: strip the BOM for codecs "UTF-16" and "UTF-32" (when 
encoding "\r" to bytes) => I don't know how to encode "\r" without the 
msg77755 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-13 17:03
A couple of suggestions:

- if IncrementalNewlineDecoder gets an encoding argument, it can also
instantiate the decoder itself; that way the API is a bit simpler

- to encode '\r' without the BOM, you can e.g. use an incremental
encoder and encode it twice:
>>> enc = codecs.getincrementalencoder('utf16')('strict')
>>> enc.encode('\r')
>>> enc.encode('\r')

I think breaking the API can be ok since the original API is broken
(witness this bug). There can even be a compatibility mode where we
check whether `encoding` has an encode() method, and if it has, take it
as the decoder object rather than the encoding name.
msg77757 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-13 18:14
Here is a simpler patch with a different approach and a lot of tests.
The advantage is that it doesn't break the API.
msg77758 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-13 18:30
This new variant also removes the dangerous hack in getstate / setstate.
msg77760 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-12-13 18:51
utf16_newlines2.patch looks good to me.

This is a data corruption issue.  If it is deferred for 3.0.1 it must be 
fixed in 3.0.2.

+1 on putting this in 3.0.1.
msg77808 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-14 17:10
Committed to py3k and release30-maint in r67760 and r67759. Needs
backporting to 2.x.
msg77813 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-14 18:11
Backported to trunk and 2.6.2 in r67762 and r67764.
Date User Action Args
2022-04-11 14:56:42adminsetgithub: 48824
2008-12-14 18:11:50pitrousetstatus: open -> closed
resolution: fixed
messages: + msg77813
2008-12-14 17:10:51pitrousetpriority: release blocker -> normal
messages: + msg77808
2008-12-13 18:51:45gregory.p.smithsetpriority: release blocker
assignee: pitrou
messages: + msg77760
nosy: + gregory.p.smith
2008-12-13 18:30:32pitrousetfiles: + utf16_newlines2.patch
messages: + msg77758
2008-12-13 18:14:06pitrousetfiles: + utf16_newlines.patch
messages: + msg77757
2008-12-13 17:03:06pitrousetmessages: + msg77755
2008-12-09 02:08:26vstinnersetfiles: + incremental_newline_decoder-2.patch
messages: + msg77384
2008-12-09 01:53:27vstinnersetfiles: + test_io.patch
keywords: + patch
messages: + msg77382
2008-12-09 00:25:11vstinnersetfiles: +
2008-12-09 00:24:44vstinnersetfiles: -
2008-12-09 00:24:05vstinnersetfiles: +
messages: + msg77378
2008-12-09 00:20:21vstinnersetnosy: + vstinner
messages: + msg77377
2008-12-07 11:43:09pitrousetnosy: + pitrou
2008-12-07 11:00:34sjmachincreate