Issue 18291: codecs.open interprets FS, RS, GS as line ends

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/62491

classification

Title:	codecs.open interprets FS, RS, GS as line ends
Type:	behavior	Stage:	patch review
Components:	IO, Unicode	Versions:	Python 3.8, Python 3.7, Python 3.6, Python 2.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	belopolsky, doerwalter, ezio.melotti, lemburg, nascheme, r.david.murray, serhiy.storchaka, vstinner, wpk, xtreak
Priority:	normal	Keywords:

Created on 2013-06-24 13:11 by wpk, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
codecs-io-example.py	wpk, 2013-06-24 15:17	Compare UTF-8 file reading time between io.open and codecs.open.
codecs_splitlines.txt	nascheme, 2018-10-05 00:20
str_splitlines.txt	nascheme, 2018-10-05 04:44	change str.splitlines to use only \r and \n

Pull Requests
URL	Status	Linked	Edit
PR 9711	open	serhiy.storchaka, 2018-10-05 11:22

Messages (21)
msg191758 - (view)	Author: Paul (wpk)	Date: 2013-06-24 13:11
I hope I am writing in the right place. When using codecs.open with UTF-8 encoding, it seems characters \x12, \x13, and \x14 are interpreted as end-of-line. Example code: >>> with open('unicodetest.txt', 'w') as f: >>> f.write('a'+chr(28)+'b'+chr(29)+'c'+chr(30)+'d'+chr(31)+'e') >>> with open('unicodetest.txt', 'r') as f: >>> for i,l in enumerate(f): >>> print i, l 0 a\x12b\x13c\x14d\x15e The point here is that it reads it as one line, as I would expect. But using codecs.open with UTF-8 encoding it reads it as many lines: >>> import codecs >>> with codecs.open('unicodetest.txt', 'r', 'UTF-8') as f: >>> for i,l in enumerate(f): >>> print i, l 0 a\x12 1 b\x13 2 c\x14 3 d\x15e The characters \x12 through \x15 are described as "Information Separator Four" through "One" (in that order). As far as I can see they never mark line ends. Also interestingly, \x15 isn't interpreted as such. As a sidenote, I tested and verified that io.open is correct (but when reading loads of data it appears to be 5 times slower than codecs): >>> import io >>> with io.open('unicodetest.txt', encoding='UTF-8') as f: >>> for i,l in enumerate(f): >>> print i, l 0 a\x12b\x13c\x14d\x15e
msg191769 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-06-24 14:32
Could you please provide an example which exposes slowness of io.open() by comparison with codecs.open().
msg191778 - (view)	Author: Paul (wpk)	Date: 2013-06-24 15:17
Sorry for bringing that up as I suppose it is unrelated to the bug I am reporting, but you can an example file attached with timings.
msg191784 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-06-24 16:33
Is the "slower" test on 2.6? io would definitely be slower there, since it is pure python. 2.7 has the C accelerated version.
msg191848 - (view)	Author: Paul (wpk)	Date: 2013-06-25 11:11
You're absolutely right. I tested it on another machine now, with Python 2.7.3 installed and it is actually twice as fast as codecs. Thanks. So I guess there is little interest in fixing codecs because io is the preferred package for reading unicode files.
msg191849 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-06-25 11:33
I guess Victor have an interest. ;)
msg191853 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-06-25 12:20
>> So I guess there is little interest in fixing codecs because io is the >> preferred package for reading unicode files. > I guess Victor have an interest. ;) Ah ah, good joke. I wrote the PEP 400: http://www.python.org/dev/peps/pep-0400/ And yes, for best performances, you have to choose between codecs and io module depending on the Python version. It suggest to always use the io module because it has more features, like universal newline, and less bugs.
msg192118 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-07-01 09:41
In contrary to documentation str.splitlines() splits lines not only on '\n', '\r\n' and '\r'. >>> 'a'.join(chr(i) for i in range(32)).splitlines(True) ['\x00a\x01a\x02a\x03a\x04a\x05a\x06a\x07a\x08a\ta\n', 'a\x0b', 'a\x0c', 'a\r', 'a\x0ea\x0fa\x10a\x11a\x12a\x13a\x14a\x15a\x16a\x17a\x18a\x19a\x1aa\x1ba\x1c', 'a\x1d', 'a\x1e', 'a\x1f']
msg192123 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-07-01 12:05
There are two issues that I could find related to these characters, one of them still open: #18236 and #7643. The latter contains a fairly complete discussion of the underlying issue, but on a quick read through it is not clear to me if the linebreak issue was actually completely addressed. It would be good if someone unicode knowledgeable would read through that issue and see if the current state of affairs is in fact correct, and if so (as seems likely, given that there were unicode experts weighing in on that issue) we need to improve the splitlines docs at least (as was suggested in that issue but not done). How tightly related that issue is to this one depends on how codecs and IO implement their linebreak algorithms Perhaps we should retitle this issue "make Python's treatment of 'information separator' and other line break characters consistent". Since backward compatibility is an issue, if there are changes to be made there may be changes that can only be made in 3.4.
msg192280 - (view)	Author: Paul (wpk)	Date: 2013-07-04 09:18
Right, #7643 indeed seems to be exactly about the issue I described here (for as much as I know unicode which isn't all that much). So maybe they should be merged. The issue was closed March 2010, is that after 2.7.3 was released? By the way, where I wrote \x12, \x13, \x14, and \x15, I should have written \x1c, \x1d, \x1e, \x1f (the hex representation of characters 28 to 31). Lost in translation, I guess.
msg327095 - (view)	Author: Neil Schemenauer (nascheme) *	Date: 2018-10-04 22:53
I think one bug here is that codecs readers use str.splitlines() internally. The splitlines method treats a bunch of different characters as line separators, unlike io.<file>.readlines(). So, you end up with different behavior between doing iter(codecs.getreader(...)) and iter(io.open(...)). We can argue if str.splitlines() is doing the correct thing, see the table here: https://docs.python.org/3.8/library/stdtypes.html#str.splitlines However, it seems clearer to me that readlines() on a codecs reader and on a file object should really be splitting lines on the same characters.
msg327096 - (view)	Author: Neil Schemenauer (nascheme) *	Date: 2018-10-05 00:20
Attached is a rough patch that tries to fix this problem. I changed the behavior in that unicode char 0x2028 is no longer treated as a line separator. It would be trival to change the regex to support that too, if we want to preserve backwards compatibility. Personally, I think readlines() on a codecs reader should do that same line splitting as an 'io' file. If we want to use the patch, the following must yet be done: write tests that check the splitting on FS, RS, and GS characters. Write a news entry. I didn't do any profiling to see what the performance effect of my change is so that should be checked too.
msg327098 - (view)	Author: Neil Schemenauer (nascheme) *	Date: 2018-10-05 03:23
Some further progress on this. My patch slows down reading files with the codecs module very significantly. So, I think it could never be merged as is. Maybe we would need to implement an alternative str.splitlines that behaves as we want, implemented in C. Looking at the uses of str.splitlines in the stdlib, I can't help but think there are many places where this (IMHO bad) behaviour of splitting on all these extra controls characters have made it so that splitlines should not be used in most cases. Or, we should change splitlines to work the same as the file readlines splitting. For example, RobotFileParser uses str.splitlines(). I suspect it should only be splitting on \n characters.
msg327100 - (view)	Author: Neil Schemenauer (nascheme) *	Date: 2018-10-05 04:44
New patch that changes str.splitlines to work like Python 2 str.splitlines and like Python 3 bytes.splitlines. Surprisingly, only a few cases in the unit test suite fail. I've fixed them in my patch.
msg327101 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-10-05 05:11
There is an open issue for changing str.splitlines(): issue22232. It would help to fix this issue. The only problem is that we don't have agreement about the new parameter name (and changing the behavior unconditionally is not an option).
msg327104 - (view)	Author: Neil Schemenauer (nascheme) *	Date: 2018-10-05 06:09
I just found bug #22232 myself but thanks for pointing it out. > changing the behavior unconditionally is not an option At this point, I disagree. If I do a search on the web, lots of pages referring to str.splitlines() seem it imply that is splits only on \r and \n. For Python 2 that was correct. I think most people would be surprised by the Python 3 behaviour. I looked through the Python stdlib and marked any place str.splitlines() was used. I have more research to do yet but I think nearly all of these cases will work better (or perhaps correctly) if str.splitlines is changed.
msg327112 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2018-10-05 08:07
The Unicode .splitlines() splits strings on what Unicode defines as linebreak characters (all code points with character properties Zl or bidirectional property B). This is different than what typical CSV file parsers or other parsers built for the ASCII text files treat as newline. They usually only break on CR, CRLF, LF, so the use of .splitlines() in this context is wrong, not the method itself. It may make sense extending .splitlines() to pass in a list of linebreak characters to break on, but that would make it a lot slower and the same can already be had by using re.split() on Unicode strings. Closing this as won't fix.
msg327125 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-10-05 11:26
PR 9711 splits lines using regular expressions. This fixes this issue without changing str.splitlines(). After adding a new option in str.splitlines() the code in master can be simplified.
msg327127 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2018-10-05 11:39
Sorry, I probably wasn't clear: the codecs interface is a direct interface to the Unicode codecs and thus has to work according to what Unicode defines. Your PR changes this to be non-compliant and does this for all codecs. That's a major backwards and Unicode incompatible change and I'm -1 on such a change for the stated reasons. If people want to have ASCII only line break handling, they should use the io module, which only uses the codecs and can apply different logic (as it does). Please note that many file formats where not defined for Unicode, and it's only natural that using Unicode codecs on them will result in some differences compared to the ASCII world. Line breaks are one of those differences, but there are plenty others as well, e.g. potentially breaking combining characters or bidi sections, different ideas about upper and lower case handling, different interpretations of control characters, etc. The approach to this has to be left with the applications dealing with these formats. The stdlib has to stick to standards and clear documentation.
msg327129 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-10-05 12:06
Then this particularity of codecs streams should be explicitly documented. codecs.open() was advertised as a way of writing portable code for Python 2 and 3, and it can still be used in many old programs.
msg327134 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2018-10-05 12:28
On 05.10.2018 14:06, Serhiy Storchaka wrote: > > Then this particularity of codecs streams should be explicitly documented. Yes, probably. Such extensions of scope for different character types in Unicode vs. ASCII are a common gotcha when moving from Python 2 to 3. The same applies to eg. upper/lower case conversion, conversion to numeric values, the various .is*() methods, etc. > codecs.open() was advertised as a way of writing portable code for Python 2 and 3, and it can still be used in many old programs. AFAIR, we changed this to recommend io.open() instead, after the io module was rewritten in C. Before that we did indeed advertise codecs.open() as a way to write code which produces Unicode in a similar way as io does in Python 3 (they were never fully identical, though).

History
Date	User	Action	Args
2022-04-11 14:57:47	admin	set	github: 62491
2018-10-05 12:28:09	lemburg	set	messages: + msg327134
2018-10-05 12:06:49	serhiy.storchaka	set	messages: + msg327129
2018-10-05 11:39:30	lemburg	set	messages: + msg327127
2018-10-05 11:26:00	serhiy.storchaka	set	status: closed -> open versions: + Python 2.7, Python 3.6, Python 3.7, Python 3.8 messages: + msg327125 resolution: wont fix -> stage: resolved -> patch review
2018-10-05 11:22:28	serhiy.storchaka	set	pull_requests: + pull_request9094
2018-10-05 08:07:02	lemburg	set	status: open -> closed resolution: wont fix messages: + msg327112 stage: resolved
2018-10-05 06:09:20	nascheme	set	messages: + msg327104
2018-10-05 05:11:50	serhiy.storchaka	set	messages: + msg327101
2018-10-05 04:44:44	nascheme	set	files: + str_splitlines.txt messages: + msg327100
2018-10-05 03:28:41	xtreak	set	nosy: + xtreak
2018-10-05 03:23:44	nascheme	set	messages: + msg327098
2018-10-05 00:20:23	nascheme	set	files: + codecs_splitlines.txt messages: + msg327096
2018-10-04 22:53:00	nascheme	set	nosy: + nascheme messages: + msg327095 versions: - Python 2.7, Python 3.3, Python 3.4
2018-10-04 21:01:39	serhiy.storchaka	link	issue34801 superseder
2015-07-10 08:05:19	martin.panter	set	title: codecs.open interprets space as line ends -> codecs.open interprets FS, RS, GS as line ends
2013-07-04 09:18:42	wpk	set	messages: + msg192280
2013-07-01 12:05:21	r.david.murray	set	messages: + msg192123
2013-07-01 09:41:45	serhiy.storchaka	set	messages: + msg192118
2013-07-01 09:36:06	serhiy.storchaka	link	issue18337 superseder
2013-06-25 12:20:54	vstinner	set	messages: + msg191853
2013-06-25 11:33:37	serhiy.storchaka	set	messages: + msg191849 versions: + Python 3.3, Python 3.4
2013-06-25 11:11:18	wpk	set	messages: + msg191848
2013-06-24 16:33:39	r.david.murray	set	nosy: + r.david.murray messages: + msg191784
2013-06-24 15:17:19	wpk	set	files: + codecs-io-example.py messages: + msg191778
2013-06-24 14:32:39	serhiy.storchaka	set	nosy: + lemburg, doerwalter, belopolsky, vstinner, serhiy.storchaka messages: + msg191769 versions: - Python 2.6
2013-06-24 13:11:12	wpk	create