classification
Title: reading large files
Type: behavior Stage:
Components: Tests Versions: Python 3.0
process
Status: closed Resolution: duplicate
Dependencies: Superseder:
Assigned To: Nosy List: Richard.Christen@unice.fr, jafo, loewis, music, pythonmeister
Priority: normal Keywords:

Created on 2007-09-10 12:45 by Richard.Christen@unice.fr, last changed 2007-09-18 04:36 by jafo. This issue is now closed.

Files
File name Uploaded Description Edit
christen.vcf Richard.Christen@unice.fr, 2007-09-10 14:20
christen.vcf Richard.Christen@unice.fr, 2007-09-10 15:46
Messages (10)
msg55777 - (view) Author: christen (Richard.Christen@unice.fr) Date: 2007-09-10 12:45
September 11, 2007 I downloaded py 3.k

The good news :
Under Windows, Python 3k properly reads files larger than 4 Go (in
contrast to python 2.5 that skips some lines, see below)

The bad news : py 3k is very slow compared to py 2.5; see the results below
the code is 
it reads a 4.9 Go file of 81,017,719 lines (a genbank entry of bacterial
sequences)

#######################
import time 
print (time.localtime())
fichin=open(r'D:\pythons\16s\total_gb_161_16S.gb')
t0= time.localtime()
print (t0)
i=0

for li in fichin:
	i+=1
	if i%1000000==0: 
		print (i,time.localtime())
	
fichin.close()
print ()
print (i)
print (time.localtime())
#########################


I got the following results (Windows XP 64) on the same machine, using
either py 3k or py 2.5
As soon as my BSD and Linux machines are done with calculations, I will
try that on them.
Best
Richard Christen


python 3k

(2007, 9, 10, 13, 53, 36, 0, 253, 1)
(2007, 9, 10, 13, 53, 36, 0, 253, 1)
1000000 (2007, 9, 10, 13, 53, 49, 0, 253, 1)
2000000 (2007, 9, 10, 13, 54, 3, 0, 253, 1)
3000000 (2007, 9, 10, 13, 54, 18, 0, 253, 1)
4000000 (2007, 9, 10, 13, 54, 32, 0, 253, 1)
5000000 (2007, 9, 10, 13, 54, 47, 0, 253, 1)
....
77000000 (2007, 9, 10, 14, 14, 55, 0, 253, 1)
78000000 (2007, 9, 10, 14, 15, 9, 0, 253, 1)
79000000 (2007, 9, 10, 14, 15, 22, 0, 253, 1)
80000000 (2007, 9, 10, 14, 15, 36, 0, 253, 1)
81000000 (2007, 9, 10, 14, 15, 49, 0, 253, 1)

81017719    #this is the proper number of lines 
(2007, 9, 10, 14, 15, 50, 0, 253, 1)


Python 2.5

(2007, 9, 10, 14, 18, 33, 0, 253, 1)
(2007, 9, 10, 14, 18, 33, 0, 253, 1)
(1000000, (2007, 9, 10, 14, 18, 34, 0, 253, 1))
(2000000, (2007, 9, 10, 14, 18, 34, 0, 253, 1))
(3000000, (2007, 9, 10, 14, 18, 35, 0, 253, 1))
(4000000, (2007, 9, 10, 14, 18, 35, 0, 253, 1))
(5000000, (2007, 9, 10, 14, 18, 36, 0, 253, 1))
...
(77000000, (2007, 9, 10, 14, 19, 10, 0, 253, 1))
(78000000, (2007, 9, 10, 14, 19, 11, 0, 253, 1))
(79000000, (2007, 9, 10, 14, 19, 11, 0, 253, 1))
(80000000, (2007, 9, 10, 14, 19, 12, 0, 253, 1))
(81000000, (2007, 9, 10, 14, 19, 12, 0, 253, 1))
()
81014962      #python 2.5 missed some lines !!!!
(2007, 9, 10, 14, 19, 12, 0, 253, 1)
msg55778 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-09-10 14:04
If you would like to help resolving the issue with the missing lines,
please submit a separate report for that. It is very difficult to track
unrelated bugs in a single tracker issue. It would help if you could
determine which lines are missing, e.g. by writing out all lines and
then comparing the two files.

If you want to compute runtimes, it is better to not convert them to
local time. Instead, use the pattern

start = time.time()
...
  print time.time()-start # seconds since the program started
msg55779 - (view) Author: christen (Richard.Christen@unice.fr) Date: 2007-09-10 14:20
Hi Martin

I could certainly do that, but how you get my huge files ? 5 Go of data 
is quite big...

> If you want to compute runtimes, it is better to not convert them to
> local time. Instead, use the pattern
>
> start = time.time()
> ...
>   print time.time()-start # seconds since the program started
>   

OK I'll do that next time

Richard
msg55781 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-09-10 14:28
> I could certainly do that, but how you get my huge files ? 5 Go of data 
> is quite big...

[not sure what "that" is] I did not mean to suggest that you attach such
a large file. Instead, just report that as a separate bug report, and be
prepared to answer follow-up questions.

Regards,
Martin
msg55782 - (view) Author: Stefan Sonnenberg-Carstens (pythonmeister) Date: 2007-09-10 14:29
Perhaps this is an issue of line separation ?
Could you provide the output of wc -l on a *NIX box ?
And, could you try with this code:

import sys
print(sys.version_info)
import time 
print (time.localtime())
fichin=open(r'D:\pythons\16s\total_gb_161_16S.gb')
start = time.time()
for i,li in enumerate(fichin):
    if i%1000000==0 and i>0: 
        print (i,start-time.time())
fichin.close()
print(i)
print(start-time.time())

Thx
msg55783 - (view) Author: Stefan Sonnenberg-Carstens (pythonmeister) Date: 2007-09-10 14:32
Sorry, this way:

import sys
print(sys.version_info)
import time 
print (time.strftime('%Y-%m-%d %H:%M:%S'))
fichin=open(r'D:\pythons\16s\total_gb_161_16S.gb')
start = time.time()
for i,li in enumerate(fichin):
    if i%1000000==0 and i>0: 
        print (i,time.time()-start)
fichin.close()
print(i)
print(time.time()-start)
msg55784 - (view) Author: christen (Richard.Christen@unice.fr) Date: 2007-09-10 15:46
Hi Stefan

Calculations are underway
both read and write do not work well with p3k

you can try the code below on your own machine :
    fichout.write(str(i)+' '*59+'\n')  #generates a big file
    fichout.write(str(i)+'\n')   #generate file <4Go

the big file is not read properly with python 2.5  (the small one is)
the big file is long to write and to read with python 3.k

I send you the results as soon it is done under 3k (very very slow indeed)

best
r

import sys
print(sys.version_info)
import time
print (time.strftime('%Y-%m-%d %H:%M:%S'))
liste=[]
start = time.time()
fichout=open('test.txt','w')
for i in xrange(85014961):
    if i%5000000==0 and i>0:
        print (i,time.time()-start)
    fichout.write(str(i)+' '*59+'\n')
fichout.close()
print ('total lines written ',i)
print (i,time.time()-start)
print ('*'*50)
fichin=open('test.txt')
start3 = time.time()
for i,li in enumerate(fichin):
    if i%5000000==0 and i>0:
        print (i,time.time()-start3)
fichin.close()
print ('total lines read ',i)
print(time.time()-start)
msg55831 - (view) Author: Ben Beasley (music) Date: 2007-09-11 18:33
I ran Richard Christen's script from msg55784 on Ubuntu Feisty Fawn
(64-bit) with both Python 2.5.1 and Python 3.0a1 (for the latter, I had
to change xrange to range).

(2, 5, 1, 'final', 0)
2007-09-11 11:39:08
(5000000, 7.3925600051879883)
(10000000, 15.068881034851074)
(15000000, 22.870260953903198)
(20000000, 30.588511943817139)
(25000000, 37.977153062820435)
(30000000, 45.393024921417236)
(35000000, 57.039968013763428)
(40000000, 71.122976064682007)
(45000000, 85.065402984619141)
(50000000, 97.03105092048645)
(55000000, 108.22125887870789)
(60000000, 122.95617389678955)
(65000000, 130.45936799049377)
(70000000, 141.0406129360199)
(75000000, 150.52000093460083)
(80000000, 158.0419979095459)
(85000000, 168.46517896652222)
('total lines written ', 85014960)
(85014960, 168.48725986480713)
**************************************************
(5000000, 11.699964046478271)
(10000000, 18.510161876678467)
(15000000, 27.110308885574341)
(20000000, 35.410284996032715)
(25000000, 41.88045597076416)
(30000000, 48.734965085983276)
(35000000, 56.416620016098022)
(40000000, 65.14509105682373)
(45000000, 73.711935043334961)
(50000000, 82.278150081634521)
(55000000, 90.984658002853394)
(60000000, 99.987648963928223)
(65000000, 104.64127588272095)
(70000000, 109.73277306556702)
(75000000, 114.78491401672363)
(80000000, 120.38562488555908)
(85000000, 126.08317303657532)
('total lines read ', 85014960)
294.583214998




(3, 0, 0, 'alpha', 1)
2007-09-11 12:20:53
5000000 117.375117064
10000000 238.183109045
15000000 357.397506952
20000000 476.816791058
25000000 597.198447943
30000000 717.393661976
35000000 837.278333902
40000000 956.919227839
45000000 1077.25333095
50000000 1196.60731292
55000000 1316.08601999
60000000 1434.81360602
65000000 1554.1584239
70000000 1673.04580498
75000000 1792.35387397
80000000 1912.65659904
85000000 2032.99598598
total lines written  85014960
85014960 2033.35042787
**************************************************
5000000 89.7920100689
10000000 180.910079002
15000000 272.628970146
20000000 364.904497147
25000000 457.229861021
30000000 549.14190793
35000000 641.054435968
40000000 733.30577898
45000000 826.058191061
50000000 917.997677088
55000000 1010.20616603
60000000 1102.142905
65000000 1194.16728902
70000000 1286.54789495
75000000 1378.50006604
80000000 1470.37746692
85000000 1562.25738001
total lines read  85014960
3595.88338494
msg55832 - (view) Author: Ben Beasley (music) Date: 2007-09-11 18:39
See the BDFL's comment in msg55828. "I know Py3k text I/O is very slow;
it's written in Python and uses UTF-8
as the default encoding.  We've got a summer of code student working on
an accelerating this.  (And if he doesn't finish we have another year to
work on it before 3.0final is released.)"
msg55988 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2007-09-18 04:36
I'm closing this because the slow I/O issue is known and expected to be
resolved as part of the Python 3.0 development.  The Windows problems
with missing lines should be opened as a separate issue.
History
Date User Action Args
2007-09-18 04:36:00jafosetstatus: open -> closed
nosy: + jafo
resolution: duplicate
messages: + msg55988
2007-09-11 18:39:49musicsetmessages: + msg55832
2007-09-11 18:33:46musicsetnosy: + music
messages: + msg55831
2007-09-10 15:46:27Richard.Christen@unice.frsetfiles: + christen.vcf
messages: + msg55784
2007-09-10 14:32:24pythonmeistersetmessages: + msg55783
2007-09-10 14:29:24pythonmeistersetnosy: + pythonmeister
messages: + msg55782
2007-09-10 14:28:44loewissetmessages: + msg55781
2007-09-10 14:20:50Richard.Christen@unice.frsetfiles: + christen.vcf
messages: + msg55779
2007-09-10 14:04:50loewissetnosy: + loewis
messages: + msg55778
2007-09-10 12:45:03Richard.Christen@unice.frcreate