Issue 6664: readlines should understand Line Separator and Paragraph Separator characters

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/50913

classification

Title:	readlines should understand Line Separator and Paragraph Separator characters
Type:	behavior	Stage:	needs patch
Components:	IO	Versions:	Python 3.2

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	benjamin.peterson, jfinkels, lemburg, nyamatongwe, pitrou, vstinner
Priority:	normal	Keywords:	patch

Created on 2009-08-07 09:14 by nyamatongwe, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
lineends.py	nyamatongwe, 2009-08-07 09:14	Program demonstrating file containing Paragraph Separator
issue6664.testcase.patch	jfinkels, 2010-10-01 16:20	Failing test case.

Messages (3)
msg91397 - (view)	Author: Neil Hodgson (nyamatongwe)	Date: 2009-08-07 09:14
Unicode includes Line Separator U+2028 and Paragraph Separator U+2029 line ending characters. The readlines method of the file object returned by the built-in open does not treat these characters as line ends although the object returned by codecs.open(..., encoding='utf-8') does. The attached program creates a UTF-8 file containing three lines with the second line ended with a Paragraph Separator. The program then reads this file back in as a text file. Only two lines are seen when reading the file back in. The desired behaviour is for the file to be read in as three lines.
msg117812 - (view)	Author: Jeffrey Finkelstein (jfinkels) *	Date: 2010-10-01 16:20
This seems to be because codecs.StreamReader.readlines() function does this: def readlines(self, sizehint=None, keepends=True): data = self.read() return data.splitlines(keepends) But the io readlines() functions make multiple calls to readline() instead. Here is the test case which passes on the codecs readlines() but fails on the io readlines().
msg125230 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-01-03 20:22
By design, readlines() only recognizes those characters which are official line separators on various OSes (\n, \r, \r\n). This is important for proper parsing of log files, internet protocols, etc. If you want to split on all line separators recognized by the unicode spec, use str.splitlines().

History
Date	User	Action	Args
2022-04-11 14:56:51	admin	set	github: 50913
2011-01-03 20:22:01	pitrou	set	status: open -> closed messages: + msg125230 resolution: rejected nosy: lemburg, nyamatongwe, pitrou, vstinner, benjamin.peterson, jfinkels
2010-10-01 16:20:57	jfinkels	set	files: + issue6664.testcase.patch nosy: + jfinkels messages: + msg117812 keywords: + patch
2010-08-01 10:53:06	pitrou	set	nosy: + lemburg, vstinner versions: - Python 3.1, Python 2.7
2010-08-01 10:40:45	BreamoreBoy	set	nosy: + pitrou, benjamin.peterson stage: needs patch type: behavior versions: + Python 2.7, Python 3.2
2009-08-07 09:14:13	nyamatongwe	create