Issue21638
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2014-06-02 13:39 by yanlinlin82, last changed 2022-04-11 14:58 by admin. This issue is now closed.
Messages (6) | |||
---|---|---|---|
msg219586 - (view) | Author: Linlin Yan (yanlinlin82) | Date: 2014-06-02 13:39 | |
I noticed this problem when I run a Python2 program (MACS: http://liulab.dfci.harvard.edu/MACS/) very inefficiently on a large storage on a high performace server (64-bit Linux). It was much slower (more than two days) than running it on a normal PC (less than two hours). After ruling out many optimizing conditions, I finally located the problem on the seek() function of Python2. Now I can reproduce the problem in a very simple example: #!/usr/bin/python2 f = open("Input.sort.bam", "rb") f.seek(0, 2) f.close() Here, the size of file 'Input.sort.bam' is 4,110,535,920 bytes. When I run the program with 'strace' to see the system calls on Linux: $ strace python2 foo.py ... open("Input.sort.bam", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=4110535920, ...}) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=4110535920, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f23d4492000 fstat(3, {st_mode=S_IFREG|0644, st_size=4110535920, ...}) = 0 lseek(3, 4110532608, SEEK_SET) = 4110532608 read(3, "f\203\337<\334\350\313\315\345&T\227\211\fC\212a\260\204P\235\366\326\353\230\327>\373\361\221\357\373"..., 3312) = 3312 close(3) = 0 ... It seems that python2 just move file cursor to a specific position (4110532608 in this case) and read ahead the rest bytes, rather than seek to the file end directly. I tried to run the exact the same program on the large storage, the position changed to 1073741824, left 889310448 bytes to read to reach the file end, which reduced the performance a lot! |
|||
msg219637 - (view) | Author: STINNER Victor (vstinner) * | Date: 2014-06-02 20:35 | |
I don't think that Python calls directly read(). Python 2 uses fopen / fread. Python 3 doesn't use buffered files, but call open / read directly. |
|||
msg219654 - (view) | Author: Linlin Yan (yanlinlin82) | Date: 2014-06-03 01:35 | |
I agree that Python 2 should use fopen / fread rather than directly read(). But you may misunderstand this. The 'strace' tool reports Linux system calls, including read() rather than fread(), and I guess that read() should be finally called in fread() implementation. What I mean is that Python 2's seek(0, 2) does not use fseek(0, SEEK_END), but fseek(somewhere, SEEK_SET) and fread(rest-bytes) instead, which is too inefficient in some kind of storage. By the way, Python 3 does not behavior like this. On Tue, Jun 3, 2014 at 4:35 AM, STINNER Victor <report@bugs.python.org> wrote: > > STINNER Victor added the comment: > > I don't think that Python calls directly read(). Python 2 uses fopen / > fread. > > Python 3 doesn't use buffered files, but call open / read directly. > > ---------- > nosy: +haypo, neologix > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue21638> > _______________________________________ > |
|||
msg219709 - (view) | Author: Charles-François Natali (neologix) * | Date: 2014-06-03 18:35 | |
> I agree that Python 2 should use fopen / fread rather than directly read(). > But you may misunderstand this. The 'strace' tool reports Linux system > calls, including read() rather than fread(), and I guess that read() should > be finally called in fread() implementation. > > What I mean is that Python 2's seek(0, 2) does not use fseek(0, SEEK_END), > but fseek(somewhere, SEEK_SET) and fread(rest-bytes) instead, which is too > inefficient in some kind of storage. Actually, Python does use fopen(), and fseek(): the culprit is the libc: $ cat /tmp/test.c; gcc -o /tmp/test /tmp/test.c -Wall; strace /tmp/test open("/etc/fstab", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=809, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77ae000 fstat64(3, {st_mode=S_IFREG|0644, st_size=809, ...}) = 0 _llseek(3, 0, [0], SEEK_SET) = 0 read(3, "# /etc/fstab: static file system"..., 809) = 809 close(3) = 0 > By the way, Python 3 does not behavior like this. That's because in Python 3, the IO stack is implemented directly on top of open()/read()/lseek(). It's not the first time we stumble upon glibc stdio bugs. I'd suggest closing this. |
|||
msg219715 - (view) | Author: Linlin Yan (yanlinlin82) | Date: 2014-06-03 18:54 | |
Thanks! I agree with that. |
|||
msg219716 - (view) | Author: Linlin Yan (yanlinlin82) | Date: 2014-06-03 18:59 | |
I ensured that the problem is in libc. I will try to figure out it by updating libc or optimizing some related parameters. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:58:04 | admin | set | github: 65837 |
2014-06-03 18:59:26 | yanlinlin82 | set | status: open -> closed resolution: third party messages: + msg219716 |
2014-06-03 18:54:34 | yanlinlin82 | set | messages: + msg219715 |
2014-06-03 18:35:01 | neologix | set | nosy:
+ pitrou messages: + msg219709 |
2014-06-03 01:35:27 | yanlinlin82 | set | messages: + msg219654 |
2014-06-02 20:35:17 | vstinner | set | nosy:
+ vstinner, neologix messages: + msg219637 |
2014-06-02 13:39:16 | yanlinlin82 | create |