Issue 24601: bytes and unicode splitlines() methods differ on what is a line break

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/68789

classification

Title:	bytes and unicode splitlines() methods differ on what is a line break
Type:		Stage:
Components:		Versions:	Python 3.6, Python 3.4, Python 3.5, Python 2.7

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	str.splitlines splitting on non-\r\n characters View: 22232
Assigned To:		Nosy List:	gregory.p.smith, martin.panter, steven.daprano
Priority:	normal	Keywords:

Created on 2015-07-10 02:18 by gregory.p.smith, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (4)
msg246538 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2015-07-10 02:18
for bytes, \v (0x0b) is not considered a line break. for unicode, it is. this traces back to the Objects/stringlib/ code where unicode defers to the decision made by Objects/unicodeobject.c's ascii_linebreak table which contains 7 line breaks in the 0..127 character range: static unsigned char ascii_linebreak[] = { 0, 0, 0, 0, 0, 0, 0, 0, /* 0x000A, * LINE FEED / / 0x000B, * LINE TABULATION / / 0x000C, * FORM FEED / / 0x000D, * CARRIAGE RETURN / 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, / 0x001C, * FILE SEPARATOR / / 0x001D, * GROUP SEPARATOR / / 0x001E, * RECORD SEPARATOR */ 0, 0, 0, 0, 1, 1, 1, 0, Whereas Objects/stringlib/stringdefs.h used by only considers \r and \n. I think these should be consistent. But making this change likely breaks existing code in weird ways. This does come up when porting from 2 to 3 as a str '' type with one of those other characters in it was not broken by splitlines in 2.x but is broken by splitlines in 3.x.
msg246539 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2015-07-10 03:02
On Fri, Jul 10, 2015 at 02:18:33AM +0000, Gregory P. Smith wrote: > for bytes, \v (0x0b) is not considered a line break. for unicode, it is. [...] > I think these should be consistent. I'm not sure that they should. Unicode includes other line breaks which bytes should not consider line breaks, such as NEL (Next Line), U+0085. Why should bytes be consistent with only the subset of line breaks that are in ASCII?
msg246549 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-07-10 08:19
* Issue 7643: Originally a complaint about the difference, but was closed after adding more differences! * Issue 22232: Documentation bug, but with some discussion on changing the API. Maybe a duplicate? * Issue 22233: Email and HTTP message parsing bug related to incorrectly using splitlines() * Issue 18291: codecs.StreamReader uses splitlines(), but io.TextIOWrapper uses universal newlines
msg246568 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2015-07-10 16:52
hah, i should've searched the tracker first. looks like the other open issues cover this.

History
Date	User	Action	Args
2022-04-11 14:58:18	admin	set	github: 68789
2015-07-10 16:52:40	gregory.p.smith	set	status: open -> closed versions: + Python 2.7, Python 3.4, Python 3.5, Python 3.6 superseder: str.splitlines splitting on non-\r\n characters messages: + msg246568 resolution: duplicate
2015-07-10 08:19:59	martin.panter	set	nosy: + martin.panter messages: + msg246549
2015-07-10 03:02:41	steven.daprano	set	nosy: + steven.daprano messages: + msg246539
2015-07-10 02:18:33	gregory.p.smith	create