Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

codecs.open interprets FS, RS, GS as line ends #62491

Open
wpk mannequin opened this issue Jun 24, 2013 · 21 comments
Open

codecs.open interprets FS, RS, GS as line ends #62491

wpk mannequin opened this issue Jun 24, 2013 · 21 comments
Labels
3.11 only security fixes 3.12 bugs and security fixes 3.13 bugs and security fixes topic-IO topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@wpk
Copy link
Mannequin

wpk mannequin commented Jun 24, 2013

BPO 18291
Nosy @malemburg, @doerwalter, @nascheme, @abalkin, @vstinner, @ezio-melotti, @bitdancer, @serhiy-storchaka, @tirkarthi
PRs
  • gh-62491: codecs text streams now split lines only with \r, \n and \r\n. #9711
  • Files
  • codecs-io-example.py: Compare UTF-8 file reading time between io.open and codecs.open.
  • codecs_splitlines.txt
  • str_splitlines.txt: change str.splitlines to use only \r and \n
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2013-06-24.13:11:12.690>
    labels = ['3.8', 'type-bug', '3.7', 'expert-unicode', 'expert-IO']
    title = 'codecs.open interprets FS, RS, GS as line ends'
    updated_at = <Date 2018-10-05.12:28:09.232>
    user = 'https://bugs.python.org/wpk'

    bugs.python.org fields:

    activity = <Date 2018-10-05.12:28:09.232>
    actor = 'lemburg'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Unicode', 'IO']
    creation = <Date 2013-06-24.13:11:12.690>
    creator = 'wpk'
    dependencies = []
    files = ['30688', '47851', '47852']
    hgrepos = []
    issue_num = 18291
    keywords = []
    message_count = 21.0
    messages = ['191758', '191769', '191778', '191784', '191848', '191849', '191853', '192118', '192123', '192280', '327095', '327096', '327098', '327100', '327101', '327104', '327112', '327125', '327127', '327129', '327134']
    nosy_count = 10.0
    nosy_names = ['lemburg', 'doerwalter', 'nascheme', 'belopolsky', 'vstinner', 'ezio.melotti', 'r.david.murray', 'serhiy.storchaka', 'wpk', 'xtreak']
    pr_nums = ['9711']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue18291'
    versions = ['Python 2.7', 'Python 3.6', 'Python 3.7', 'Python 3.8']

    @wpk
    Copy link
    Mannequin Author

    wpk mannequin commented Jun 24, 2013

    I hope I am writing in the right place.

    When using codecs.open with UTF-8 encoding, it seems characters \x12, \x13, and \x14 are interpreted as end-of-line.

    Example code:

    >>> with open('unicodetest.txt', 'w') as f:
    >>>   f.write('a'+chr(28)+'b'+chr(29)+'c'+chr(30)+'d'+chr(31)+'e')
    >>> with open('unicodetest.txt', 'r') as f:
    >>>   for i,l in enumerate(f):
    >>>     print i, l
    0 a\x12b\x13c\x14d\x15e

    The point here is that it reads it as one line, as I would expect. But using codecs.open with UTF-8 encoding it reads it as many lines:

    >>> import codecs
    >>> with codecs.open('unicodetest.txt', 'r', 'UTF-8') as f:
    >>>   for i,l in enumerate(f):
    >>>     print i, l
    0 a\x12
    1 b\x13
    2 c\x14
    3 d\x15e

    The characters \x12 through \x15 are described as "Information Separator Four" through "One" (in that order). As far as I can see they never mark line ends. Also interestingly, \x15 isn't interpreted as such.

    As a sidenote, I tested and verified that io.open is correct (but when reading loads of data it appears to be 5 times slower than codecs):

    >>> import io
    >>> with io.open('unicodetest.txt', encoding='UTF-8') as f:
    >>>   for i,l in enumerate(f):
    >>>     print i, l
    0 a\x12b\x13c\x14d\x15e

    @wpk wpk mannequin added topic-unicode topic-IO type-bug An unexpected behavior, bug, or error labels Jun 24, 2013
    @serhiy-storchaka
    Copy link
    Member

    Could you please provide an example which exposes slowness of io.open() by comparison with codecs.open().

    @wpk
    Copy link
    Mannequin Author

    wpk mannequin commented Jun 24, 2013

    Sorry for bringing that up as I suppose it is unrelated to the bug I am reporting, but you can an example file attached with timings.

    @bitdancer
    Copy link
    Member

    Is the "slower" test on 2.6? io would definitely be slower there, since it is pure python. 2.7 has the C accelerated version.

    @wpk
    Copy link
    Mannequin Author

    wpk mannequin commented Jun 25, 2013

    You're absolutely right. I tested it on another machine now, with Python 2.7.3 installed and it is actually twice as fast as codecs. Thanks.

    So I guess there is little interest in fixing codecs because io is the preferred package for reading unicode files.

    @serhiy-storchaka
    Copy link
    Member

    I guess Victor have an interest. ;)

    @vstinner
    Copy link
    Member

    > So I guess there is little interest in fixing codecs because io is the
    > preferred package for reading unicode files.

    I guess Victor have an interest. ;)

    Ah ah, good joke. I wrote the PEP-400:
    http://www.python.org/dev/peps/pep-0400/

    And yes, for best performances, you have to choose between codecs and
    io module depending on the Python version. It suggest to always use
    the io module because it has more features, like universal newline,
    and less bugs.

    @serhiy-storchaka
    Copy link
    Member

    In contrary to documentation str.splitlines() splits lines not only on '\n', '\r\n' and '\r'.

    >>> 'a'.join(chr(i) for i in range(32)).splitlines(True)
    ['\x00a\x01a\x02a\x03a\x04a\x05a\x06a\x07a\x08a\ta\n', 'a\x0b', 'a\x0c', 'a\r', 'a\x0ea\x0fa\x10a\x11a\x12a\x13a\x14a\x15a\x16a\x17a\x18a\x19a\x1aa\x1ba\x1c', 'a\x1d', 'a\x1e', 'a\x1f']

    @bitdancer
    Copy link
    Member

    There are two issues that I could find related to these characters, one of them still open: bpo-18236 and bpo-7643. The latter contains a fairly complete discussion of the underlying issue, but on a quick read through it is not clear to me if the linebreak issue was actually completely addressed. It would be good if someone unicode knowledgeable would read through that issue and see if the current state of affairs is in fact correct, and if so (as seems likely, given that there were unicode experts weighing in on that issue) we need to improve the splitlines docs at least (as was suggested in that issue but not done). How tightly related that issue is to this one depends on how codecs and IO implement their linebreak algorithms

    Perhaps we should retitle this issue "make Python's treatment of 'information separator' and other line break characters consistent".

    Since backward compatibility is an issue, if there are changes to be made there may be changes that can only be made in 3.4.

    @wpk
    Copy link
    Mannequin Author

    wpk mannequin commented Jul 4, 2013

    Right, bpo-7643 indeed seems to be exactly about the issue I described here (for as much as I know unicode which isn't all that much). So maybe they should be merged. The issue was closed March 2010, is that after 2.7.3 was released?

    By the way, where I wrote \x12, \x13, \x14, and \x15, I should have written \x1c, \x1d, \x1e, \x1f (the hex representation of characters 28 to 31). Lost in translation, I guess.

    @vadmium vadmium changed the title codecs.open interprets space as line ends codecs.open interprets FS, RS, GS as line ends Jul 10, 2015
    @nascheme
    Copy link
    Member

    nascheme commented Oct 4, 2018

    I think one bug here is that codecs readers use str.splitlines() internally. The splitlines method treats a bunch of different characters as line separators, unlike io.<file>.readlines(). So, you end up with different behavior between doing iter(codecs.getreader(...)) and iter(io.open(...)).

    We can argue if str.splitlines() is doing the correct thing, see the table here:
    https://docs.python.org/3.8/library/stdtypes.html#str.splitlines

    However, it seems clearer to me that readlines() on a codecs reader and on a file object should really be splitting lines on the same characters.

    @nascheme
    Copy link
    Member

    nascheme commented Oct 5, 2018

    Attached is a rough patch that tries to fix this problem. I changed the behavior in that unicode char 0x2028 is no longer treated as a line separator. It would be trival to change the regex to support that too, if we want to preserve backwards compatibility. Personally, I think readlines() on a codecs reader should do that same line splitting as an 'io' file.

    If we want to use the patch, the following must yet be done: write tests that check the splitting on FS, RS, and GS characters. Write a news entry. I didn't do any profiling to see what the performance effect of my change is so that should be checked too.

    @nascheme
    Copy link
    Member

    nascheme commented Oct 5, 2018

    Some further progress on this. My patch slows down reading files with the codecs module very significantly. So, I think it could never be merged as is. Maybe we would need to implement an alternative str.splitlines that behaves as we want, implemented in C.

    Looking at the uses of str.splitlines in the stdlib, I can't help but think there are many places where this (IMHO bad) behaviour of splitting on all these extra controls characters have made it so that splitlines should not be used in most cases. Or, we should change splitlines to work the same as the file readlines splitting.

    For example, RobotFileParser uses str.splitlines(). I suspect it should only be splitting on \n characters.

    @nascheme
    Copy link
    Member

    nascheme commented Oct 5, 2018

    New patch that changes str.splitlines to work like Python 2 str.splitlines and like Python 3 bytes.splitlines. Surprisingly, only a few cases in the unit test suite fail. I've fixed them in my patch.

    @serhiy-storchaka
    Copy link
    Member

    There is an open issue for changing str.splitlines(): bpo-22232. It would help to fix this issue. The only problem is that we don't have agreement about the new parameter name (and changing the behavior unconditionally is not an option).

    @nascheme
    Copy link
    Member

    nascheme commented Oct 5, 2018

    I just found bug bpo-22232 myself but thanks for pointing it out.

    changing the behavior unconditionally is not an option

    At this point, I disagree. If I do a search on the web, lots of pages referring to str.splitlines() seem it imply that is splits only on \r and \n. For Python 2 that was correct. I think most people would be surprised by the Python 3 behaviour.

    I looked through the Python stdlib and marked any place str.splitlines() was used. I have more research to do yet but I think nearly all of these cases will work better (or perhaps correctly) if str.splitlines is changed.

    @malemburg
    Copy link
    Member

    The Unicode .splitlines() splits strings on what Unicode defines as linebreak characters (all code points with character properties Zl or bidirectional property B).

    This is different than what typical CSV file parsers or other parsers built for the ASCII text files treat as newline. They usually only break on CR, CRLF, LF, so the use of .splitlines() in this context is wrong, not the method itself.

    It may make sense extending .splitlines() to pass in a list of linebreak characters to break on, but that would make it a lot slower and the same can already be had by using re.split() on Unicode strings.

    Closing this as won't fix.

    @serhiy-storchaka
    Copy link
    Member

    PR 9711 splits lines using regular expressions. This fixes this issue without changing str.splitlines(). After adding a new option in str.splitlines() the code in master can be simplified.

    @serhiy-storchaka serhiy-storchaka added 3.7 (EOL) end of life 3.8 only security fixes labels Oct 5, 2018
    @malemburg
    Copy link
    Member

    Sorry, I probably wasn't clear: the codecs interface is a direct
    interface to the Unicode codecs and thus has to work according to
    what Unicode defines.

    Your PR changes this to be non-compliant and does this for all codecs.
    That's a major backwards and Unicode incompatible change and I'm -1
    on such a change for the stated reasons.

    If people want to have ASCII only line break handling, they should
    use the io module, which only uses the codecs and can apply different
    logic (as it does).

    Please note that many file formats where not defined for Unicode,
    and it's only natural that using Unicode codecs on them will
    result in some differences compared to the ASCII world. Line breaks
    are one of those differences, but there are plenty others as well,
    e.g. potentially breaking combining characters or bidi sections,
    different ideas about upper and lower case handling, different
    interpretations of control characters, etc.

    The approach to this has to be left with the applications dealing
    with these formats. The stdlib has to stick to standards and
    clear documentation.

    @serhiy-storchaka
    Copy link
    Member

    Then this particularity of codecs streams should be explicitly documented.

    codecs.open() was advertised as a way of writing portable code for Python 2 and 3, and it can still be used in many old programs.

    @malemburg
    Copy link
    Member

    On 05.10.2018 14:06, Serhiy Storchaka wrote:

    Then this particularity of codecs streams should be explicitly documented.

    Yes, probably. Such extensions of scope for different character
    types in Unicode vs. ASCII are a common gotcha when moving from
    Python 2 to 3. The same applies to eg. upper/lower
    case conversion, conversion to numeric values, the various .is*()
    methods, etc.

    codecs.open() was advertised as a way of writing portable code for Python 2 and 3, and it can still be used in many old programs.

    AFAIR, we changed this to recommend io.open() instead,
    after the io module was rewritten in C.

    Before that we did indeed advertise codecs.open() as a way to
    write code which produces Unicode in a similar way as io does
    in Python 3 (they were never fully identical, though).

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @iritkatriel iritkatriel added 3.11 only security fixes 3.12 bugs and security fixes 3.13 bugs and security fixes and removed 3.8 only security fixes 3.7 (EOL) end of life labels Oct 28, 2023
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes 3.12 bugs and security fixes 3.13 bugs and security fixes topic-IO topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    Development

    No branches or pull requests

    6 participants