classification
Title: Line count mismatch between open() vs sys.stdin api calls
Type: behavior Stage: resolved
Components: IO, Library (Lib), Unicode Versions:
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, serhiy.storchaka, steven.daprano, terry.reedy, thammegowda, vstinner
Priority: normal Keywords:

Created on 2019-11-08 06:01 by thammegowda, last changed 2019-11-09 13:08 by eric.smith. This issue is now closed.

Files
File name Uploaded Description Edit
line_break_err.txt thammegowda, 2019-11-08 06:01
Messages (5)
msg356224 - (view) Author: Thamme Gowda (thammegowda) Date: 2019-11-08 06:01
I ran into a line count mismatch bug and I narrowed it down to 9 lines where the line break handling is causing an issue. Please find the attachment named line_break_err.txt to reproduce the below. 


$ md5sum line_break_err.txt
5dea501b8e299a0ece94d85977728545  line_break_err.txt

# wc says there are 9 lines
$ wc -l line_break_err.txt
9 line_break_err.txt

# if I read from sys.stdin, I get 9 lines
$ python -c 'import sys; print(sum(1 for x in sys.stdin))' < line_break_err.txt

# but... if I use a open() call, i get 18 
$ python -c 'import sys; print("Linecount=", sum(1 for x in open(sys.argv[1])))' line_break_err.txt
Linecount= 18

# changing encoding or error handling has no effect
$ python -c 'import sys; print("Linecount=", sum(1 for x in open(sys.argv[1], "r", encoding="utf-8", errors="replace")))' line_break_err.txt
Linecount= 18

$ python -c 'import sys; print("Linecount=", sum(1 for x in open(sys.argv[1], "r", encoding="utf-8", errors="ignore")))' line_break_err.txt
Linecount= 18
# but, not just wc, even awk says there are only 9 lines
$ awk 'END {print "Linecount=", NR}' line_break_err.txt
Linecount= 9

# let's see python 2 using io
# python2 -c 'import sys,io; print("Linecount=", sum(1 for x in io.open(sys.argv[1],  encoding="ascii", errors="ignore")))' line_break_err.txt
('Linecount=', 18)

# But this one which we no longer use somehow gets it right
$ python2 -c 'import sys; print("Linecount=", sum(1 for x in open(sys.argv[1])))' line_break_err.txt
('Linecount=', 9)


Tested it on 
1. Linux 
Python 3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 21:52:21)
[GCC 7.3.0] :: Anaconda, Inc. on linux


2. OSX
Python 3.7.3 (default, Mar 27 2019, 16:54:48)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin

3. python 2 on OSX
Python 2.7.16 (default, Jun 19 2019, 07:40:37)
[GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.46.4)] on darwin

---- 

P.S. 
this is my first issue created. If this issue is a duplicate, I am happy to close it.
msg356231 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-11-08 08:30
Try to add newline="\n" in open().
msg356232 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-11-08 09:04
This seems to be the difference between Universal Newlines or not. In Python 2, you have to set it explicitly with a U in the open mode:

    $ python2.7 -c 'import sys; print("Linecount=", sum(1 for x in open(sys.argv[1], "Ur")))' line_break_err.txt
    ('Linecount=', 18)

In Python 3, Universal Newlines is the default for text files, but you can control it with the ``newline`` parameter:

    $ python3.5 -c 'import sys; print("Linecount=", sum(1 for x in open(sys.argv[1], newline="\n")))' line_break_err.txt
    Linecount= 9


    $ python3.5 -c 'import sys; print("Linecount=", sum(1 for x in open(sys.argv[1], newline="\r")))' line_break_err.txt
    Linecount= 15


I think this explains the difference you are seeing. Do you agree?
msg356250 - (view) Author: Thamme Gowda (thammegowda) Date: 2019-11-08 17:01
Thanks for the quick response. 
Yes ``newline="\n"`` fixed it. 
So it as a known behavior. (I was tempted to consider it as a bug since the behavior differed from sys.stdin)

Thank you.
msg356278 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2019-11-09 01:11
Should this be closed as 'not a bug'?
History
Date User Action Args
2019-11-09 13:08:09eric.smithsetresolution: not a bug
2019-11-09 02:07:39thammegowdasetstatus: open -> closed
stage: resolved
2019-11-09 01:11:30terry.reedysetnosy: + terry.reedy

messages: + msg356278
title: Line count mis match between open() vs sys.stdin api calls -> Line count mismatch between open() vs sys.stdin api calls
2019-11-08 17:01:41thammegowdasetmessages: + msg356250
2019-11-08 09:04:49steven.dapranosetnosy: + steven.daprano
messages: + msg356232
2019-11-08 08:30:43serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg356231
2019-11-08 06:01:54thammegowdasetfiles: + line_break_err.txt
2019-11-08 06:01:02thammegowdacreate