Message 246538 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	gregory.p.smith
Recipients	gregory.p.smith
Date	2015-07-10.02:18:32
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1436494713.33.0.435726120833.issue24601@psf.upfronthosting.co.za>
In-reply-to

Content
for bytes, \v (0x0b) is not considered a line break. for unicode, it is. this traces back to the Objects/stringlib/ code where unicode defers to the decision made by Objects/unicodeobject.c's ascii_linebreak table which contains 7 line breaks in the 0..127 character range: static unsigned char ascii_linebreak[] = { 0, 0, 0, 0, 0, 0, 0, 0, /* 0x000A, * LINE FEED / / 0x000B, * LINE TABULATION / / 0x000C, * FORM FEED / / 0x000D, * CARRIAGE RETURN / 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, / 0x001C, * FILE SEPARATOR / / 0x001D, * GROUP SEPARATOR / / 0x001E, * RECORD SEPARATOR */ 0, 0, 0, 0, 1, 1, 1, 0, Whereas Objects/stringlib/stringdefs.h used by only considers \r and \n. I think these should be consistent. But making this change likely breaks existing code in weird ways. This does come up when porting from 2 to 3 as a str '' type with one of those other characters in it was not broken by splitlines in 2.x but is broken by splitlines in 3.x.

for bytes, \v (0x0b) is not considered a line break.  for unicode, it is.

this traces back to the Objects/stringlib/ code where unicode defers to the decision made by Objects/unicodeobject.c's ascii_linebreak table which contains 7 line breaks in the 0..127 character range:

static unsigned char ascii_linebreak[] = {
    0, 0, 0, 0, 0, 0, 0, 0,
/*         0x000A, * LINE FEED */
/*         0x000B, * LINE TABULATION */
/*         0x000C, * FORM FEED */
/*         0x000D, * CARRIAGE RETURN */
    0, 0, 1, 1, 1, 1, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0,
/*         0x001C, * FILE SEPARATOR */
/*         0x001D, * GROUP SEPARATOR */
/*         0x001E, * RECORD SEPARATOR */
    0, 0, 0, 0, 1, 1, 1, 0,


Whereas Objects/stringlib/stringdefs.h used by only considers \r and \n.

I think these should be consistent.  But making this change likely breaks existing code in weird ways.

This does come up when porting from 2 to 3 as a str '' type with one of those other characters in it was not broken by splitlines in 2.x but is broken by splitlines in 3.x.

History
Date	User	Action	Args
2015-07-10 02:18:33	gregory.p.smith	set	recipients: + gregory.p.smith
2015-07-10 02:18:33	gregory.p.smith	set	messageid: <1436494713.33.0.435726120833.issue24601@psf.upfronthosting.co.za>
2015-07-10 02:18:33	gregory.p.smith	link	issue24601 messages
2015-07-10 02:18:32	gregory.p.smith	create