This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author richardchristen
Recipients
Date 2006-03-16.17:21:35
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
I work on the human genome
I extracted words from chromosomes using a suffix tree
(C compiled for 64 done on a SUN with 300 Go RAM, since
my suffix tree requires 150 Go RAM for chromosome 1,
the largest one)

this gave some >5 Go files, for example with 163763326
lines for chr 4, the one presently analyzed.

Using python 2.4.2 on a windows 32-computer (1.5 Go
RAM), reading this file line by line either

for li in file:
    do something

or

while li!='':
    li=file.readline()

I got problems seemingly around the 4 Go boundary
(after reading the problematic first line), for some
lines (not all), the li returned the correct content
but with the first word of the next line also within li
(see below)

As a result a simple
file1=open('1')
file2=open('2','w')
li=file1.readline()
while li!='':
    file2.write(li) 
    li=file1.readline()

produced a second file of only
163754385 lines
problem lines were "seemingly random", i.e. not in a
row, with the last line being OK.


The same code on the same file but on my OSX
64-dualcore machine went fine, despite the use of
default Python 2.2.3 and "file Python" showing it is a
Mach-0 executable ppc, i.e. a 32 bit app.

Everything was run from the command line.


the first file looks like that
...
TCAGCCACAGCAGAAAGTGA:\t33240 551212 751185
TCAGCCACAGCAGAAAGTGC:\t131324047
TCAGCCACAGCACTGTGTTA:\t61641912
....

the second file contains lines like these :
TCAGCCACAGCAGAAAGTGC:\t131324047TCAGCCACAGCAGAAGAAGA:  

which is 'first line'+'1rst word of next line'

PS1 : no problem to read the big file with UEdit on the
windows machine. Therefore the OS itself is not the
problem (also I transfered the bigfile from the Windows
to the Mac, if the file had had problems, it would have
been corrupted on the Mac)
PS2 : I tried python 2.3.5 on windows with the same
problem.
PS3: If needed, I can run the same test on a similar
file but for chromosome 8 which is slightly below the 4
Go limit (3.99).
PS4: I think I remember having done a similar parsing
on a Linux Athlon 64 monoCPU a month ago, with no trouble.
History
Date User Action Args
2007-08-23 14:38:32adminlinkissue1451466 messages
2007-08-23 14:38:32admincreate