classification
Title: codecs.open interprets FS, RS, GS as line ends
Type: behavior Stage: patch review
Components: IO, Unicode Versions: Python 3.8, Python 3.7, Python 3.6, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: belopolsky, doerwalter, ezio.melotti, lemburg, nascheme, r.david.murray, serhiy.storchaka, vstinner, wpk, xtreak
Priority: normal Keywords:

Created on 2013-06-24 13:11 by wpk, last changed 2018-10-05 12:28 by lemburg.

Files
File name Uploaded Description Edit
codecs-io-example.py wpk, 2013-06-24 15:17 Compare UTF-8 file reading time between io.open and codecs.open.
codecs_splitlines.txt nascheme, 2018-10-05 00:20
str_splitlines.txt nascheme, 2018-10-05 04:44 change str.splitlines to use only \r and \n
Pull Requests
URL Status Linked Edit
PR 9711 open serhiy.storchaka, 2018-10-05 11:22
Messages (21)
msg191758 - (view) Author: Paul (wpk) Date: 2013-06-24 13:11
I hope I am writing in the right place.

When using codecs.open with UTF-8 encoding, it seems characters \x12, \x13, and \x14 are interpreted as end-of-line.

Example code:

>>> with open('unicodetest.txt', 'w') as f:
>>>   f.write('a'+chr(28)+'b'+chr(29)+'c'+chr(30)+'d'+chr(31)+'e')
>>> with open('unicodetest.txt', 'r') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12b\x13c\x14d\x15e

The point here is that it reads it as one line, as I would expect. But using codecs.open with UTF-8 encoding it reads it as many lines:

>>> import codecs
>>> with codecs.open('unicodetest.txt', 'r', 'UTF-8') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12
1 b\x13
2 c\x14
3 d\x15e

The characters \x12 through \x15 are described as "Information Separator Four" through "One" (in that order). As far as I can see they never mark line ends. Also interestingly, \x15 isn't interpreted as such.

As a sidenote, I tested and verified that io.open is correct (but when reading loads of data it appears to be 5 times slower than codecs):

>>> import io
>>> with io.open('unicodetest.txt', encoding='UTF-8') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12b\x13c\x14d\x15e
msg191769 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-06-24 14:32
Could you please provide an example which exposes slowness of io.open() by comparison with codecs.open().
msg191778 - (view) Author: Paul (wpk) Date: 2013-06-24 15:17
Sorry for bringing that up as I suppose it is unrelated to the bug I am reporting, but you can an example file attached with timings.
msg191784 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-06-24 16:33
Is the "slower" test on 2.6?  io would definitely be slower there, since it is pure python.  2.7 has the C accelerated version.
msg191848 - (view) Author: Paul (wpk) Date: 2013-06-25 11:11
You're absolutely right. I tested it on another machine now, with Python 2.7.3 installed and it is actually twice as fast as codecs. Thanks.

So I guess there is little interest in fixing codecs because io is the preferred package for reading unicode files.
msg191849 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-06-25 11:33
I guess Victor have an interest. ;)
msg191853 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-06-25 12:20
>> So I guess there is little interest in fixing codecs because io is the
>> preferred package for reading unicode files.

> I guess Victor have an interest. ;)

Ah ah, good joke. I wrote the PEP 400:
http://www.python.org/dev/peps/pep-0400/

And yes, for best performances, you have to choose between codecs and
io module depending on the Python version. It suggest to always use
the io module because it has more features, like universal newline,
and less bugs.
msg192118 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-07-01 09:41
In contrary to documentation str.splitlines() splits lines not only on '\n', '\r\n' and '\r'.

>>> 'a'.join(chr(i) for i in range(32)).splitlines(True)
['\x00a\x01a\x02a\x03a\x04a\x05a\x06a\x07a\x08a\ta\n', 'a\x0b', 'a\x0c', 'a\r', 'a\x0ea\x0fa\x10a\x11a\x12a\x13a\x14a\x15a\x16a\x17a\x18a\x19a\x1aa\x1ba\x1c', 'a\x1d', 'a\x1e', 'a\x1f']
msg192123 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-07-01 12:05
There are two issues that I could find related to these characters, one of them still open:  #18236 and #7643.  The latter contains a fairly complete discussion of the underlying issue, but on a quick read through it is not clear to me if the linebreak issue was actually completely addressed.  It would be good if someone unicode knowledgeable would read through that issue and see if the current state of affairs is in fact correct, and if so (as seems likely, given that there were unicode experts weighing in on that issue) we need to improve the splitlines docs at least (as was suggested in that issue but not done).  How tightly related that issue is to this one depends on how codecs and IO implement their linebreak algorithms

Perhaps we should retitle this issue "make Python's treatment of 'information separator' and other line break characters consistent".

Since backward compatibility is an issue, if there are changes to be made there may be changes that can only be made in 3.4.
msg192280 - (view) Author: Paul (wpk) Date: 2013-07-04 09:18
Right, #7643 indeed seems to be exactly about the issue I described here (for as much as I know unicode which isn't all that much). So maybe they should be merged. The issue was closed March 2010, is that after 2.7.3 was released?

By the way, where I wrote \x12, \x13, \x14, and \x15, I should have written \x1c, \x1d, \x1e, \x1f (the hex representation of characters 28 to 31). Lost in translation, I guess.
msg327095 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2018-10-04 22:53
I think one bug here is that codecs readers use str.splitlines() internally.  The splitlines method treats a bunch of different characters as line separators, unlike io.<file>.readlines().  So, you end up with different behavior between doing iter(codecs.getreader(...)) and iter(io.open(...)).

We can argue if str.splitlines() is doing the correct thing, see the table here:
https://docs.python.org/3.8/library/stdtypes.html#str.splitlines

However, it seems clearer to me that readlines() on a codecs reader and on a file object should really be splitting lines on the same characters.
msg327096 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2018-10-05 00:20
Attached is a rough patch that tries to fix this problem.  I changed the behavior in that unicode char 0x2028 is no longer treated as a line separator.  It would be trival to change the regex to support that too, if we want to preserve backwards compatibility.  Personally, I think readlines() on a codecs reader should do that same line splitting as an 'io' file.

If we want to use the patch, the following must yet be done: write tests that check the splitting on FS, RS, and GS characters.  Write a news entry.  I didn't do any profiling to see what the performance effect of my change is so that should be checked too.
msg327098 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2018-10-05 03:23
Some further progress on this.  My patch slows down reading files with the codecs module very significantly.  So, I think it could never be merged as is.  Maybe we would need to implement an alternative str.splitlines that behaves as we want, implemented in C.

Looking at the uses of str.splitlines in the stdlib, I can't help but think there are many places where this (IMHO bad) behaviour of splitting on all these extra controls characters have made it so that splitlines should not be used in most cases.  Or, we should change splitlines to work the same as the file readlines splitting.

For example, RobotFileParser uses str.splitlines().  I suspect it should only be splitting on \n characters.
msg327100 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2018-10-05 04:44
New patch that changes str.splitlines to work like Python 2 str.splitlines and like Python 3 bytes.splitlines.  Surprisingly, only a few cases in the unit test suite fail.  I've fixed them in my patch.
msg327101 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-10-05 05:11
There is an open issue for changing str.splitlines(): issue22232. It would help to fix this issue. The only problem is that we don't have agreement about the new parameter name (and changing the behavior unconditionally is not an option).
msg327104 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2018-10-05 06:09
I just found bug #22232 myself but thanks for pointing it out.

> changing the behavior unconditionally is not an option

At this point, I disagree.  If I do a search on the web, lots of pages referring to str.splitlines() seem it imply that is splits only on \r and \n.  For Python 2 that was correct.  I think most people would be surprised by the Python 3 behaviour.

I looked through the Python stdlib and marked any place str.splitlines() was used.  I have more research to do yet but I think nearly all of these cases will work better (or perhaps correctly) if str.splitlines is changed.
msg327112 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2018-10-05 08:07
The Unicode .splitlines() splits strings on what Unicode defines as linebreak characters (all code points with character properties Zl or bidirectional property B).

This is different than what typical CSV file parsers or other parsers built for the ASCII text files treat as newline. They usually only break on CR, CRLF, LF, so the use of .splitlines() in this context is wrong, not the method itself.

It may make sense extending .splitlines() to pass in a list of linebreak characters to break on, but that would make it a lot slower and the same can already be had by using re.split() on Unicode strings.

Closing this as won't fix.
msg327125 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-10-05 11:26
PR 9711 splits lines using regular expressions. This fixes this issue without changing str.splitlines(). After adding a new option in str.splitlines() the code in master can be simplified.
msg327127 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2018-10-05 11:39
Sorry, I probably wasn't clear: the codecs interface is a direct 
interface to the Unicode codecs and thus has to work according to 
what Unicode defines.

Your PR changes this to be non-compliant and does this for all codecs.
That's a major backwards and Unicode incompatible change and I'm -1
on such a change for the stated reasons.

If people want to have ASCII only line break handling, they should
use the io module, which only uses the codecs and can apply different
logic (as it does).

Please note that many file formats where not defined for Unicode,
and it's only natural that using Unicode codecs on them will
result in some differences compared to the ASCII world. Line breaks
are one of those differences, but there are plenty others as well,
e.g. potentially breaking combining characters or bidi sections,
different ideas about upper and lower case handling, different
interpretations of control characters, etc.

The approach to this has to be left with the applications dealing
with these formats. The stdlib has to stick to standards and
clear documentation.
msg327129 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-10-05 12:06
Then this particularity of codecs streams should be explicitly documented.

codecs.open() was advertised as a way of writing portable code for Python 2 and 3, and it can still be used in many old programs.
msg327134 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2018-10-05 12:28
On 05.10.2018 14:06, Serhiy Storchaka wrote:
> 
> Then this particularity of codecs streams should be explicitly documented.

Yes, probably. Such extensions of scope for different character
types in Unicode vs. ASCII are a common gotcha when moving from
Python 2 to 3. The same applies to eg. upper/lower
case conversion, conversion to numeric values, the various .is*()
methods, etc.

> codecs.open() was advertised as a way of writing portable code for Python 2 and 3, and it can still be used in many old programs.

AFAIR, we changed this to recommend io.open() instead,
after the io module was rewritten in C.

Before that we did indeed advertise codecs.open() as a way to
write code which produces Unicode in a similar way as io does
in Python 3 (they were never fully identical, though).
History
Date User Action Args
2018-10-05 12:28:09lemburgsetmessages: + msg327134
2018-10-05 12:06:49serhiy.storchakasetmessages: + msg327129
2018-10-05 11:39:30lemburgsetmessages: + msg327127
2018-10-05 11:26:00serhiy.storchakasetstatus: closed -> open
versions: + Python 2.7, Python 3.6, Python 3.7, Python 3.8
messages: + msg327125

resolution: wont fix ->
stage: resolved -> patch review
2018-10-05 11:22:28serhiy.storchakasetpull_requests: + pull_request9094
2018-10-05 08:07:02lemburgsetstatus: open -> closed
resolution: wont fix
messages: + msg327112

stage: resolved
2018-10-05 06:09:20naschemesetmessages: + msg327104
2018-10-05 05:11:50serhiy.storchakasetmessages: + msg327101
2018-10-05 04:44:44naschemesetfiles: + str_splitlines.txt

messages: + msg327100
2018-10-05 03:28:41xtreaksetnosy: + xtreak
2018-10-05 03:23:44naschemesetmessages: + msg327098
2018-10-05 00:20:23naschemesetfiles: + codecs_splitlines.txt

messages: + msg327096
2018-10-04 22:53:00naschemesetnosy: + nascheme

messages: + msg327095
versions: - Python 2.7, Python 3.3, Python 3.4
2018-10-04 21:01:39serhiy.storchakalinkissue34801 superseder
2015-07-10 08:05:19martin.pantersettitle: codecs.open interprets space as line ends -> codecs.open interprets FS, RS, GS as line ends
2013-07-04 09:18:42wpksetmessages: + msg192280
2013-07-01 12:05:21r.david.murraysetmessages: + msg192123
2013-07-01 09:41:45serhiy.storchakasetmessages: + msg192118
2013-07-01 09:36:06serhiy.storchakalinkissue18337 superseder
2013-06-25 12:20:54vstinnersetmessages: + msg191853
2013-06-25 11:33:37serhiy.storchakasetmessages: + msg191849
versions: + Python 3.3, Python 3.4
2013-06-25 11:11:18wpksetmessages: + msg191848
2013-06-24 16:33:39r.david.murraysetnosy: + r.david.murray
messages: + msg191784
2013-06-24 15:17:19wpksetfiles: + codecs-io-example.py

messages: + msg191778
2013-06-24 14:32:39serhiy.storchakasetnosy: + lemburg, doerwalter, belopolsky, vstinner, serhiy.storchaka

messages: + msg191769
versions: - Python 2.6
2013-06-24 13:11:12wpkcreate