codecs.open interprets FS, RS, GS as line ends #62491

wpk · 2013-06-24T13:11:13Z

BPO	18291
Nosy	@malemburg, @doerwalter, @nascheme, @abalkin, @vstinner, @ezio-melotti, @bitdancer, @serhiy-storchaka, @tirkarthi
PRs	gh-62491: codecs text streams now split lines only with \r, \n and \r\n. #9711
Files	codecs-io-example.py: Compare UTF-8 file reading time between io.open and codecs.open. codecs_splitlines.txt str_splitlines.txt: change str.splitlines to use only \r and \n

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2013-06-24.13:11:12.690>
labels = ['3.8', 'type-bug', '3.7', 'expert-unicode', 'expert-IO']
title = 'codecs.open interprets FS, RS, GS as line ends'
updated_at = <Date 2018-10-05.12:28:09.232>
user = 'https://bugs.python.org/wpk'

bugs.python.org fields:

activity = <Date 2018-10-05.12:28:09.232>
actor = 'lemburg'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Unicode', 'IO']
creation = <Date 2013-06-24.13:11:12.690>
creator = 'wpk'
dependencies = []
files = ['30688', '47851', '47852']
hgrepos = []
issue_num = 18291
keywords = []
message_count = 21.0
messages = ['191758', '191769', '191778', '191784', '191848', '191849', '191853', '192118', '192123', '192280', '327095', '327096', '327098', '327100', '327101', '327104', '327112', '327125', '327127', '327129', '327134']
nosy_count = 10.0
nosy_names = ['lemburg', 'doerwalter', 'nascheme', 'belopolsky', 'vstinner', 'ezio.melotti', 'r.david.murray', 'serhiy.storchaka', 'wpk', 'xtreak']
pr_nums = ['9711']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue18291'
versions = ['Python 2.7', 'Python 3.6', 'Python 3.7', 'Python 3.8']

wpk · 2013-06-24T13:11:12Z

I hope I am writing in the right place.

When using codecs.open with UTF-8 encoding, it seems characters \x12, \x13, and \x14 are interpreted as end-of-line.

Example code:

>>> with open('unicodetest.txt', 'w') as f:
>>>   f.write('a'+chr(28)+'b'+chr(29)+'c'+chr(30)+'d'+chr(31)+'e')
>>> with open('unicodetest.txt', 'r') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12b\x13c\x14d\x15e

The point here is that it reads it as one line, as I would expect. But using codecs.open with UTF-8 encoding it reads it as many lines:

>>> import codecs
>>> with codecs.open('unicodetest.txt', 'r', 'UTF-8') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12
1 b\x13
2 c\x14
3 d\x15e

The characters \x12 through \x15 are described as "Information Separator Four" through "One" (in that order). As far as I can see they never mark line ends. Also interestingly, \x15 isn't interpreted as such.

As a sidenote, I tested and verified that io.open is correct (but when reading loads of data it appears to be 5 times slower than codecs):

>>> import io
>>> with io.open('unicodetest.txt', encoding='UTF-8') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12b\x13c\x14d\x15e

serhiy-storchaka · 2013-06-24T14:32:40Z

Could you please provide an example which exposes slowness of io.open() by comparison with codecs.open().

wpk · 2013-06-24T15:17:19Z

Sorry for bringing that up as I suppose it is unrelated to the bug I am reporting, but you can an example file attached with timings.

bitdancer · 2013-06-24T16:33:39Z

Is the "slower" test on 2.6? io would definitely be slower there, since it is pure python. 2.7 has the C accelerated version.

wpk · 2013-06-25T11:11:18Z

You're absolutely right. I tested it on another machine now, with Python 2.7.3 installed and it is actually twice as fast as codecs. Thanks.

So I guess there is little interest in fixing codecs because io is the preferred package for reading unicode files.

serhiy-storchaka · 2013-06-25T11:33:37Z

I guess Victor have an interest. ;)

vstinner · 2013-06-25T12:20:54Z

> So I guess there is little interest in fixing codecs because io is the
> preferred package for reading unicode files.

I guess Victor have an interest. ;)

Ah ah, good joke. I wrote the PEP-400:
http://www.python.org/dev/peps/pep-0400/

And yes, for best performances, you have to choose between codecs and
io module depending on the Python version. It suggest to always use
the io module because it has more features, like universal newline,
and less bugs.

serhiy-storchaka · 2013-07-01T09:41:45Z

In contrary to documentation str.splitlines() splits lines not only on '\n', '\r\n' and '\r'.

>>> 'a'.join(chr(i) for i in range(32)).splitlines(True)
['\x00a\x01a\x02a\x03a\x04a\x05a\x06a\x07a\x08a\ta\n', 'a\x0b', 'a\x0c', 'a\r', 'a\x0ea\x0fa\x10a\x11a\x12a\x13a\x14a\x15a\x16a\x17a\x18a\x19a\x1aa\x1ba\x1c', 'a\x1d', 'a\x1e', 'a\x1f']

bitdancer · 2013-07-01T12:05:21Z

There are two issues that I could find related to these characters, one of them still open: bpo-18236 and bpo-7643. The latter contains a fairly complete discussion of the underlying issue, but on a quick read through it is not clear to me if the linebreak issue was actually completely addressed. It would be good if someone unicode knowledgeable would read through that issue and see if the current state of affairs is in fact correct, and if so (as seems likely, given that there were unicode experts weighing in on that issue) we need to improve the splitlines docs at least (as was suggested in that issue but not done). How tightly related that issue is to this one depends on how codecs and IO implement their linebreak algorithms

Perhaps we should retitle this issue "make Python's treatment of 'information separator' and other line break characters consistent".

Since backward compatibility is an issue, if there are changes to be made there may be changes that can only be made in 3.4.

wpk · 2013-07-04T09:18:42Z

Right, bpo-7643 indeed seems to be exactly about the issue I described here (for as much as I know unicode which isn't all that much). So maybe they should be merged. The issue was closed March 2010, is that after 2.7.3 was released?

By the way, where I wrote \x12, \x13, \x14, and \x15, I should have written \x1c, \x1d, \x1e, \x1f (the hex representation of characters 28 to 31). Lost in translation, I guess.

nascheme · 2018-10-04T22:53:01Z

I think one bug here is that codecs readers use str.splitlines() internally. The splitlines method treats a bunch of different characters as line separators, unlike io.<file>.readlines(). So, you end up with different behavior between doing iter(codecs.getreader(...)) and iter(io.open(...)).

We can argue if str.splitlines() is doing the correct thing, see the table here:
https://docs.python.org/3.8/library/stdtypes.html#str.splitlines

However, it seems clearer to me that readlines() on a codecs reader and on a file object should really be splitting lines on the same characters.

nascheme · 2018-10-05T00:20:20Z

Attached is a rough patch that tries to fix this problem. I changed the behavior in that unicode char 0x2028 is no longer treated as a line separator. It would be trival to change the regex to support that too, if we want to preserve backwards compatibility. Personally, I think readlines() on a codecs reader should do that same line splitting as an 'io' file.

If we want to use the patch, the following must yet be done: write tests that check the splitting on FS, RS, and GS characters. Write a news entry. I didn't do any profiling to see what the performance effect of my change is so that should be checked too.

nascheme · 2018-10-05T03:23:44Z

Some further progress on this. My patch slows down reading files with the codecs module very significantly. So, I think it could never be merged as is. Maybe we would need to implement an alternative str.splitlines that behaves as we want, implemented in C.

Looking at the uses of str.splitlines in the stdlib, I can't help but think there are many places where this (IMHO bad) behaviour of splitting on all these extra controls characters have made it so that splitlines should not be used in most cases. Or, we should change splitlines to work the same as the file readlines splitting.

For example, RobotFileParser uses str.splitlines(). I suspect it should only be splitting on \n characters.

nascheme · 2018-10-05T04:44:44Z

New patch that changes str.splitlines to work like Python 2 str.splitlines and like Python 3 bytes.splitlines. Surprisingly, only a few cases in the unit test suite fail. I've fixed them in my patch.

serhiy-storchaka · 2018-10-05T05:11:51Z

There is an open issue for changing str.splitlines(): bpo-22232. It would help to fix this issue. The only problem is that we don't have agreement about the new parameter name (and changing the behavior unconditionally is not an option).

nascheme · 2018-10-05T06:09:21Z

I just found bug bpo-22232 myself but thanks for pointing it out.

changing the behavior unconditionally is not an option

At this point, I disagree. If I do a search on the web, lots of pages referring to str.splitlines() seem it imply that is splits only on \r and \n. For Python 2 that was correct. I think most people would be surprised by the Python 3 behaviour.

I looked through the Python stdlib and marked any place str.splitlines() was used. I have more research to do yet but I think nearly all of these cases will work better (or perhaps correctly) if str.splitlines is changed.

malemburg · 2018-10-05T08:07:02Z

The Unicode .splitlines() splits strings on what Unicode defines as linebreak characters (all code points with character properties Zl or bidirectional property B).

This is different than what typical CSV file parsers or other parsers built for the ASCII text files treat as newline. They usually only break on CR, CRLF, LF, so the use of .splitlines() in this context is wrong, not the method itself.

It may make sense extending .splitlines() to pass in a list of linebreak characters to break on, but that would make it a lot slower and the same can already be had by using re.split() on Unicode strings.

Closing this as won't fix.

serhiy-storchaka · 2018-10-05T11:26:01Z

PR 9711 splits lines using regular expressions. This fixes this issue without changing str.splitlines(). After adding a new option in str.splitlines() the code in master can be simplified.

malemburg · 2018-10-05T11:39:30Z

Sorry, I probably wasn't clear: the codecs interface is a direct
interface to the Unicode codecs and thus has to work according to
what Unicode defines.

Your PR changes this to be non-compliant and does this for all codecs.
That's a major backwards and Unicode incompatible change and I'm -1
on such a change for the stated reasons.

If people want to have ASCII only line break handling, they should
use the io module, which only uses the codecs and can apply different
logic (as it does).

Please note that many file formats where not defined for Unicode,
and it's only natural that using Unicode codecs on them will
result in some differences compared to the ASCII world. Line breaks
are one of those differences, but there are plenty others as well,
e.g. potentially breaking combining characters or bidi sections,
different ideas about upper and lower case handling, different
interpretations of control characters, etc.

The approach to this has to be left with the applications dealing
with these formats. The stdlib has to stick to standards and
clear documentation.

serhiy-storchaka · 2018-10-05T12:06:50Z

Then this particularity of codecs streams should be explicitly documented.

codecs.open() was advertised as a way of writing portable code for Python 2 and 3, and it can still be used in many old programs.

malemburg · 2018-10-05T12:28:09Z

On 05.10.2018 14:06, Serhiy Storchaka wrote:

Then this particularity of codecs streams should be explicitly documented.

Yes, probably. Such extensions of scope for different character
types in Unicode vs. ASCII are a common gotcha when moving from
Python 2 to 3. The same applies to eg. upper/lower
case conversion, conversion to numeric values, the various .is*()
methods, etc.

codecs.open() was advertised as a way of writing portable code for Python 2 and 3, and it can still be used in many old programs.

AFAIR, we changed this to recommend io.open() instead,
after the io module was rewritten in C.

Before that we did indeed advertise codecs.open() as a way to
write code which produces Unicode in a similar way as io does
in Python 3 (they were never fully identical, though).

wpk mannequin added topic-unicode topic-IO type-bug An unexpected behavior, bug, or error labels Jun 24, 2013

vadmium changed the title ~~codecs.open interprets space as line ends~~ codecs.open interprets FS, RS, GS as line ends Jul 10, 2015

malemburg closed this as completed Oct 5, 2018

serhiy-storchaka added 3.7 (EOL) end of life 3.8 only security fixes labels Oct 5, 2018

serhiy-storchaka reopened this Oct 5, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

bedevere-bot mentioned this issue Apr 15, 2023

gh-62491: codecs text streams now split lines only with \r, \n and \r\n. #9711

Open

iritkatriel added 3.11 only security fixes 3.12 bugs and security fixes 3.13 bugs and security fixes and removed 3.8 only security fixes 3.7 (EOL) end of life labels Oct 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

codecs.open interprets FS, RS, GS as line ends #62491

codecs.open interprets FS, RS, GS as line ends #62491

wpk mannequin commented Jun 24, 2013

wpk mannequin commented Jun 24, 2013

serhiy-storchaka commented Jun 24, 2013

wpk mannequin commented Jun 24, 2013

bitdancer commented Jun 24, 2013

wpk mannequin commented Jun 25, 2013

serhiy-storchaka commented Jun 25, 2013

vstinner commented Jun 25, 2013

serhiy-storchaka commented Jul 1, 2013

bitdancer commented Jul 1, 2013

wpk mannequin commented Jul 4, 2013

nascheme commented Oct 4, 2018

nascheme commented Oct 5, 2018

nascheme commented Oct 5, 2018

nascheme commented Oct 5, 2018

serhiy-storchaka commented Oct 5, 2018

nascheme commented Oct 5, 2018

malemburg commented Oct 5, 2018

serhiy-storchaka commented Oct 5, 2018

malemburg commented Oct 5, 2018

serhiy-storchaka commented Oct 5, 2018

malemburg commented Oct 5, 2018

codecs.open interprets FS, RS, GS as line ends #62491

codecs.open interprets FS, RS, GS as line ends #62491

Comments

wpk mannequin commented Jun 24, 2013

wpk mannequin commented Jun 24, 2013

serhiy-storchaka commented Jun 24, 2013

wpk mannequin commented Jun 24, 2013

bitdancer commented Jun 24, 2013

wpk mannequin commented Jun 25, 2013

serhiy-storchaka commented Jun 25, 2013

vstinner commented Jun 25, 2013

serhiy-storchaka commented Jul 1, 2013

bitdancer commented Jul 1, 2013

wpk mannequin commented Jul 4, 2013

nascheme commented Oct 4, 2018

nascheme commented Oct 5, 2018

nascheme commented Oct 5, 2018

nascheme commented Oct 5, 2018

serhiy-storchaka commented Oct 5, 2018

nascheme commented Oct 5, 2018

malemburg commented Oct 5, 2018

serhiy-storchaka commented Oct 5, 2018

malemburg commented Oct 5, 2018

serhiy-storchaka commented Oct 5, 2018

malemburg commented Oct 5, 2018