This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: bytes and unicode splitlines() methods differ on what is a line break
Type: Stage:
Components: Versions: Python 3.6, Python 3.4, Python 3.5, Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: str.splitlines splitting on non-\r\n characters
View: 22232
Assigned To: Nosy List: gregory.p.smith, martin.panter, steven.daprano
Priority: normal Keywords:

Created on 2015-07-10 02:18 by gregory.p.smith, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (4)
msg246538 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2015-07-10 02:18
for bytes, \v (0x0b) is not considered a line break.  for unicode, it is.

this traces back to the Objects/stringlib/ code where unicode defers to the decision made by Objects/unicodeobject.c's ascii_linebreak table which contains 7 line breaks in the 0..127 character range:

static unsigned char ascii_linebreak[] = {
    0, 0, 0, 0, 0, 0, 0, 0,
/*         0x000A, * LINE FEED */
/*         0x000B, * LINE TABULATION */
/*         0x000C, * FORM FEED */
/*         0x000D, * CARRIAGE RETURN */
    0, 0, 1, 1, 1, 1, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0,
/*         0x001C, * FILE SEPARATOR */
/*         0x001D, * GROUP SEPARATOR */
/*         0x001E, * RECORD SEPARATOR */
    0, 0, 0, 0, 1, 1, 1, 0,


Whereas Objects/stringlib/stringdefs.h used by only considers \r and \n.

I think these should be consistent.  But making this change likely breaks existing code in weird ways.

This does come up when porting from 2 to 3 as a str '' type with one of those other characters in it was not broken by splitlines in 2.x but is broken by splitlines in 3.x.
msg246539 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2015-07-10 03:02
On Fri, Jul 10, 2015 at 02:18:33AM +0000, Gregory P. Smith wrote:

> for bytes, \v (0x0b) is not considered a line break.  for unicode, it is.
[...]
> I think these should be consistent.

I'm not sure that they should. Unicode includes other line breaks which 
bytes should not consider line breaks, such as NEL (Next Line), U+0085. 
Why should bytes be consistent with only the subset of line breaks that 
are in ASCII?
msg246549 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-07-10 08:19
* Issue 7643: Originally a complaint about the difference, but was closed after adding more differences!
* Issue 22232: Documentation bug, but with some discussion on changing the API. Maybe a duplicate?
* Issue 22233: Email and HTTP message parsing bug related to incorrectly using splitlines()
* Issue 18291: codecs.StreamReader uses splitlines(), but io.TextIOWrapper uses universal newlines
msg246568 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2015-07-10 16:52
hah, i should've searched the tracker first.  looks like the other open issues cover this.
History
Date User Action Args
2022-04-11 14:58:18adminsetgithub: 68789
2015-07-10 16:52:40gregory.p.smithsetstatus: open -> closed
versions: + Python 2.7, Python 3.4, Python 3.5, Python 3.6
superseder: str.splitlines splitting on non-\r\n characters
messages: + msg246568

resolution: duplicate
2015-07-10 08:19:59martin.pantersetnosy: + martin.panter
messages: + msg246549
2015-07-10 03:02:41steven.dapranosetnosy: + steven.daprano
messages: + msg246539
2015-07-10 02:18:33gregory.p.smithcreate