Issue 9593: utf8 codec readlines error after "\x85 "

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53802

classification

Title:	utf8 codec readlines error after "\x85 "
Type:	behavior	Stage:
Components:	Interpreter Core, IO, Unicode	Versions:	Python 2.7

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, jcope, pitrou
Priority:	normal	Keywords:

Created on 2010-08-13 19:27 by jcope, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
ErrorProof-utf8-x85.py	jcope, 2010-08-13 19:27

Messages (5)
msg113818 - (view)	Author: Joseph Copenhaver (jcope)	Date: 2010-08-13 19:27
The IO readlines() facility incorrectly processes utf8 files for some unknown reason. Specifically, the call generates too many entries in the lines array result after a character sequence "\x85 blah" which gets cut as ("\x85 ","blah") according the the resultant array. My workaround for this issue is not elegant, especially since I need the newline characters: #BEGIN: WTF a_str_whole = fs_in.read() fs_in.close() a_str_lines = a_str_whole.split("\n") for idx in range(0,len(a_str_lines)-1): a_str_lines[idx]+="\n" #END: WTF Attached is an example script that defines the problem clearly.
msg113821 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-08-13 19:37
U+0085 corresponds to a line terminator (). and codecs.open() observes this convention. Do note that the new io.open() (or the built-in open() in 3.x) only recognizes '\r' and '\n' as line separators. In any case, changing this behaviour would break compatibility, therefore I'm rejecting the issue. () http://en.wikipedia.org/wiki/Newline#Unicode
msg113823 - (view)	Author: Joseph Copenhaver (jcope)	Date: 2010-08-13 20:13
I now recognize the issue was in regard to format problems and not python, but the area where this code will be used requires the use of the codecs module. Is there any way to get the efficiency of codecs I/O readlines() chunking behavior and specify a list of characters to use? Can the file delimiter be changed in python as in perl? Thanks for the quick response.
msg113827 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-08-13 20:29
> Is there any way to get the efficiency of codecs I/O readlines() > chunking behavior and specify a list of characters to use? Can the > file delimiter be changed in python as in perl? No, but you can use readlines() from the standard open() function (which will give you 8-bit strings), and then decode individual lines yourself.
msg113829 - (view)	Author: Joseph Copenhaver (jcope)	Date: 2010-08-13 20:54
It is better, thanks.

History
Date	User	Action	Args
2022-04-11 14:57:05	admin	set	github: 53802
2010-08-13 20:54:53	jcope	set	messages: + msg113829
2010-08-13 20:29:21	pitrou	set	messages: + msg113827
2010-08-13 20:13:07	jcope	set	messages: + msg113823
2010-08-13 19:37:49	pitrou	set	status: open -> closed nosy: + pitrou messages: + msg113821 resolution: rejected
2010-08-13 19:29:33	ezio.melotti	set	nosy: + ezio.melotti components: - Regular Expressions
2010-08-13 19:27:02	jcope	create