This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Richard.Christen@unice.fr
Recipients Richard.Christen@unice.fr, gvanrossum, pythonmeister
Date 2007-09-11.06:18:17
SpamBayes Score 0.0049443776
Marked as misclassified No
Message-id <46E632C1.80503@unice.fr>
In-reply-to <1189461329.47.0.127999464531.issue1142@psf.upfronthosting.co.za>
Content
Hi Guido

It is not the end of the file that is not read (see also below)

I found about that about one year ago when I was parsing very large 
files resulting from "blast" on the human genome
My parser chock after 4 Go, well before the end of the file : one line 
was missing and my acc=li[x:y] end up with an error, because acc was 
never filled...
This was kind of strange because this had not happened before with my 
Linux box.

I opened the file (which I had created myself) with a editor that could 
show hexa code : the proper line was there and allright.
If I remember well, I modified my code to see better what was going on : 
in fact the missing line had been concateneted to the previous line 
despite the proper existence of the end of line (hexa code was ok). see 
also below

I forgot about that because nobody replied to my mails, and I thought it 
was possibly related with windows 32 . I moved to a windows 64 recently 
(windows has the best driver for SQL databases) and forgot about the bug 
until I again ran into it. I then decided to try python 3k, it reads 
 >4Go file with no trouble but is so so slow, both in reading and 
writing files.
The following code produces either <4Go or >4Go files depending upon 
which fichout.write is commented
They both have the same line numbers, but the >4Go does not read 
completely under windows (32 or 64)
I have no such pb on Linux or BSD (Mac).

python 3k on windows read both files ok, but is very very slow (change 
xrange to range , I guess it is preposterous to advice you about that :-).

best
Richard

import sys
print(sys.version_info)
import time
print (time.strftime('%Y-%m-%d %H:%M:%S'))
liste=[]
start = time.time()
fichout=open('test.txt','w')
for i in xrange(85014961):
    if i%5000000==0 and i>0:
        print (i,time.time()-start)
    fichout.write(str(i)+' '*59+'\n')      #big file
    #fichout.write(str(i)+'\n')            #small file, same number of lines

    fishout.flush()
fichout.close()
print ('total lines written ',i)
print (i,time.time()-start)
print ('*'*50)
fichin=open('test.txt')
start3 = time.time()
for i,li in enumerate(fichin):
    if i%5000000==0 and i>0:
        print (i,time.time()-start3)
fichin.close()
print ('total lines read ',i)
print(time.time()-start)

> Richard, can you somehow view the end of the file to see what its last
> lines actually are?  It should end like this:
>
> 85014951
> 85014952
> 85014953
> 85014954
> 85014955
> 85014956
> 85014957
> 85014958
> 85014959
> 85014960
>
>   

using a text editor reads:
85014944                                                          
85014945                                                          
85014946                                                          
85014947                                                          
85014948                                                          
85014949                                                          
85014950                                                          
85014951                                                          
85014952                                                          
85014953                                                          
85014954                                                          
85014955                                                          
85014956                                                          
85014957                                                          
85014958                                                          
85014959                                                          
85014960                                                          

windows py 2.5, with
if i>85014940:
        print i, li.strip()

prints :
(2, 5, 0, 'final', 0)
2007-09-11 07:58:47
(5000000, 2.6720001697540283)
(10000000, 5.375)
(15000000, 8.0320000648498535)
(20000000, 10.703000068664551)
(25000000, 13.375)
(30000000, 16.047000169754028)
(35000000, 18.703000068664551)
(40000000, 21.360000133514404)
(45000000, 24.032000064849854)
(50000000, 26.687999963760376)
(55000000, 29.360000133514404)
(60000000, 32.032000064849854)
(65000000, 34.703000068664551)
(70000000, 37.407000064849854)
(75000000, 40.094000101089478)
(80000000, 42.797000169754028)
(85000000, 45.485000133514404)
85014941 85014951                                                          
85014942 85014952                                                          
85014943 85014953                                                          
85014944 85014954                                                          
85014945 85014955                                                          
85014946 85014956                                                          
85014947 85014957                                                          
85014948 85014958                                                          
85014949 85014959                                                          
85014950 85014960  

==> missing lines are from within the file

now introduce in the loop: if len(li)>80: print li.strip()

(2, 5, 0, 'final', 0)
2007-09-11 08:08:16
(5000000, 3.1559998989105225)
(10000000, 6.3280000686645508)
(15000000, 9.4839999675750732)
(20000000, 12.655999898910522)
(25000000, 15.843999862670898)
(30000000, 19.016000032424927)
(35000000, 22.187999963760376)
(40000000, 25.358999967575073)
(45000000, 28.530999898910522)
(50000000, 31.703000068664551)
(55000000, 34.858999967575073)
(60000000, 38.030999898910522)
* 62410138                                                           
62410139 *
* 62414887                                                           
62414888 *
* 62415540                                                           
62415541 *
* 62420289                                                           
62420290 *
* 62420942                                                           
62420943 *
* 62421595                                                           
62421596 *
* 62422248                                                           
62422249 *
* 62422901                                                           
62422902 *
* 62427650                                                           
62427651 *
* 62428303                                                           
62428304 *
(65000000, 41.233999967575073)
(70000000, 44.437999963760376)
(75000000, 47.625)
(80000000, 50.828000068664551)
(85000000, 54.016000032424927)
('total lines read ', 85014950)
54.0309998989

==> end of line not read for 10 lines in the middle of the file ! NTFS 
file system

best
Richard
Files
File name Uploaded
christen.vcf Richard.Christen@unice.fr, 2007-09-11.06:18:17
History
Date User Action Args
2007-09-11 06:18:20Richard.Christen@unice.frsetspambayes_score: 0.00494438 -> 0.0049443776
recipients: + Richard.Christen@unice.fr, gvanrossum, pythonmeister
2007-09-11 06:18:19Richard.Christen@unice.frlinkissue1142 messages
2007-09-11 06:18:17Richard.Christen@unice.frcreate