Issue17440
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2013-03-16 16:33 by gsingh, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Messages (14) | |||
---|---|---|---|
msg184330 - (view) | Author: Gurmeet Singh (gsingh) | Date: 2013-03-16 16:33 | |
1. The read mode is not the default mode as mentioned in the docs.python.org. In particular see the first Traceback below - "b" does not work (as it does in C though) and you are forced to use "rb" instead. 2. io.BufferedReader does not implement read1 (the last lines of trace below) 3. io.FileIO does not implements single OS system call on read() - instead reads a file until EOF i.e. ignores the arguments supplied to read() - larger arguments are slower to execute (see the read calls in the trace below). _________________ >>> import io >>> fl = io.FileIO('c:/temp9/Capability/Analyzing Data.mp4', 'r') >>> byt = fl.read() >>> len(byt) 70934549 >>> fl.close() >>> fl = io.FileIO('c:/temp9/Capability/Analyzing Data.mp4', 'r') >>> byt = fl.read(256) >>> len(byt) 256 >>> byt = fl.read(512) >>> len(byt) 512 >>> byt = fl.read(1024) >>> len(byt) 1024 >>> byt = fl.read(4096) >>> len(byt) 4096 >>> byt = fl.read(10240) >>> len(byt) 10240 >>> len(fl.read(40960)) 40960 >>> len(fl.read(102400)) 102400 >>> len(fl.read(1048576)) 1048576 >>> fl.close() >>> fl = io.FileIO('c:/temp9/Capability/Analyzing Data.mp4', 'r') >>> len(fl.read(70934549)) 70934549 >>> len(fl.read(70934549)) 0 >>> fl.close() >>> fl = io.FileIO('c:/temp9/Capability/Analyzing Data.mp4', 'r') >>> b = bytearray(70934549) >>> fl.readinto(b) 70934549 >>> fl.close() >>> fl = open ('c:/temp9/Capability/Analyzing Data.mp4', 'b', buffering = 0) Traceback (most recent call last): File "<pyshell#31>", line 1, in <module> fl = open ('c:/temp9/Capability/Analyzing Data.mp4', 'b', buffering = 0) ValueError: Must have exactly one of create/read/write/append mode and at most one plus >>> fl = open ('c:/temp9/Capability/Analyzing Data.mp4', 'rb', buffering = 0) >>> type(fl) <class '_io.FileIO'> >>> cfl = io.FileIO('c:/temp9/Capability/Analyzing Data.mp4', 'r') >>> type(cfl) <class '_io.FileIO'> >>> cfl.close() >>> cfl = open ('c:/temp9/Capability/Analyzing Data.mp4', 'rb', buffering = -1) >>> type(cfl) <class '_io.BufferedReader'> >>> io.DEFAULT_BUFFER_SIZE 8192 >>> len(fl.read(70934549)) 70934549 >>> cfl.close() >>> cfl = open ('c:/temp9/Capability/Analyzing Data.mp4', 'rb', buffering = -1) >>> len(fl.read1(70934549)) Traceback (most recent call last): File "<pyshell#44>", line 1, in <module> len(fl.read1(70934549)) AttributeError: '_io.FileIO' object has no attribute 'read1' >>> |
|||
msg184332 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2013-03-16 16:51 | |
> 1. The read mode is not the default mode as mentioned in the > docs.python.org. It is. If you don't mention a mode, the mode is "r" by default. But if you mention a mode, then you are required to specify one of "r", "w", "a". > io.BufferedReader does not implement read1 (the last lines of trace > below) It does. You made a mistake in your experiment (you called read1() on a FileIO object, not a BufferedReader object). > io.FileIO does not implements single OS system call on read() - instead > reads a file until EOF i.e. ignores the arguments supplied to read() Your experiments show otherwise, the argument supplied to read() is observed: if you call read(1024), at most 1024 bytes are returned, etc. It's only if you call read() without an argument that the file is being read until EOF. |
|||
msg184368 - (view) | Author: Gurmeet Singh (gsingh) | Date: 2013-03-17 13:50 | |
Please consider following before making a decision: __________ > io.BufferedReader does not implement read1 (the last lines of trace > below) >It does. You made a mistake in your experiment (you called read1() on a FileIO object, not a BufferedReader object). Please see the following lines: >>> cfl = open ('c:/temp9/Capability/Analyzing Data.mp4', 'rb', buffering = -1) >>> type(cfl) <class '_io.BufferedReader'> According to me it is a _io.BufferedReader only and not just _io.FileIO (the base class). Please tell me if I am wrong here. |
|||
msg184369 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2013-03-17 14:01 | |
You called read1() on fl (a FileIO object) and not cfl (a BufferedReader object). Your fault for choosing confusing variable names :-) >>> len(fl.read1(70934549)) Traceback (most recent call last): File "<pyshell#44>", line 1, in <module> len(fl.read1(70934549)) AttributeError: '_io.FileIO' object has no attribute 'read1' Please try to call cfl.read1() and see whether it works (it should). |
|||
msg184370 - (view) | Author: Gurmeet Singh (gsingh) | Date: 2013-03-17 14:01 | |
Please consider following before making a decision: >> io.FileIO does not implements single OS system call on read() - instead >> reads a file until EOF i.e. ignores the arguments supplied to read() >Your experiments show otherwise, the argument supplied to read() is >observed: if you call read(1024), at most 1024 bytes are returned, etc. >It's only if you call read() without an argument that the file is being >read until EOF. I said this because I saw the following in the docs: >class io.RawIOBase >read(n=-1) >Read up to n bytes from the object and return them. As a convenience, >if n is unspecified or -1, readall() is called. Otherwise, only one >system call is ever made. Fewer than n bytes may be returned if the >operating system call returns fewer than n bytes. If only one system call is being made, then I think that fl.read(256) and fl.read(70934549) should take same amount of time to complete - assuming disk I/O is the time consuming factor in this operation (as compared to memory processing). I am only saying that instead of one system call being made - the whole size specified by read is being read (by multiple system calls - as it appears to me). Please tell me if I am wrong here. |
|||
msg184371 - (view) | Author: Gurmeet Singh (gsingh) | Date: 2013-03-17 14:02 | |
@Antoine - wait I will do it |
|||
msg184374 - (view) | Author: Gurmeet Singh (gsingh) | Date: 2013-03-17 14:09 | |
@Antoine It worked. I was wrong to say read1() was not implemented. Sorry. But please do consider other issues. |
|||
msg184375 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2013-03-17 14:11 | |
> If only one system call is being made, then I think that > fl.read(256) and fl.read(70934549) should take same amount of time to > complete - assuming disk I/O is the time consuming factor in this > operation (as compared to memory processing). What do you mean? Reading a large number of bytes will most certainly always be slower than reading a small number of bytes, even if it only takes one system call. You still have to copy the data from disk or filesystem buffers into userspace. A reasonable expectation is for read(N) to be O(N), but not O(1). You might want to check that by timing it with different N values. |
|||
msg184376 - (view) | Author: Gurmeet Singh (gsingh) | Date: 2013-03-17 14:32 | |
I did the following to understand time taken for in memory copy: 1>>> fl = io.FileIO('c:/temp9/Capability/Analyzing Data.mp4', 'rb') 2>>> byt = fl.read(70934549) 3>>> byt2 = None 4>>> byt2 = byt[:] 5>>> fl.close() 6>>> fl = io.FileIO('c:/temp9/Capability/Analyzing Data.mp4', 'rb') 7>>> byt = fl.read(1) I found that python interpreter blocked for negligible time on line 4 (and line 7), as compared to line 2. I assume that line 4 is a correct syntax for an in place memory copy. Therefore, multiple system calls could be taking place - This is how I assumed. Please suggest if I am incorrect. |
|||
msg184378 - (view) | Author: Gurmeet Singh (gsingh) | Date: 2013-03-17 14:34 | |
Sorry, typo in the last post - I meant "in memory - memory copy" not "in place memory copy". |
|||
msg184380 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2013-03-17 15:07 | |
Bytes objects are immutable, so trying to "copy" them doesn't copy anything actually (it's an optimization): >>> b = b"x" *10 >>> id(b) 139720033059920 >>> b2 = b[:] >>> id(b2) 139720033059920 FileIO.read() only calls the underlying read() once, you can check the implementation: http://hg.python.org/cpython/file/8002f45377d4/Modules/_io/fileio.c#l703 |
|||
msg184383 - (view) | Author: Gurmeet Singh (gsingh) | Date: 2013-03-17 16:02 | |
Thanks for letting me know about the optimization. I trusted you that the system call is made once, though I looked up code to see if size of the read in buffer is being passed to the C routine. I should apologize though for raising this issue - since it is incorrect. But, I think you would be interested (out of CURIOSITY) in findings from the last experiment that I did to understand this issue: 1 >>> import io 2 >>> fl = io.FileIO('c:/temp9/Capability/Analyzing Data.mp4', 'rb') 3 >>> barr = bytearray(70934549) 4 >>> barr2= bytearray(70934549) 5 >>> id(barr) 29140440 6 >>> id(barr2) 26433560 7 >>> fl.readinto(barr) 70934549 8 >>> barr2 = barr[:] 9 >>> fl.close() 10 >>> fl = io.FileIO('c:/temp9/Capability/Analyzing Data.mp4', 'rb') 11 >>> barrt = bytearray(1) 12 >>> id(barrt) 34022512 13 >>> fl.readinto(barrt) 1 14 >>> fl.close() >>> The time of line 7 was much greater than line 13. It was also greater than 8 (but not that great as of 11). But I cannot say for sure that the time for line 13 plus line 8 is equal to or lesser than 7 - it looks lesser but needs more precise testing to say anything further. I tried to reason the situation as follows (for this I looked up the hyperlink that you gave). I feel that the underlying system call takes the size argument - so I guess that large value suggests the C compiler to make ask the disk subsystem to read up the longer data - hence it takes the time since disk access is slower. Thanks for your time. Sorry for the incorrect issue. |
|||
msg184385 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2013-03-17 16:09 | |
> The time of line 7 was much greater than line 13. Well, yes, reading 70 MB is much longer than reading a single byte :-) > I feel that the underlying system call takes the size argument Indeed it does. It would be totally inefficient if it didn't. > so I guess that large value suggests the C compiler to make ask the > disk subsystem to read up the longer data - hence it takes the time > since disk access is slower. It's not the C compiler. It's the OS kernel which reads data from the disk when you ask to. |
|||
msg184386 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2013-03-17 16:09 | |
Anyway, I'm now closing the issue as invalid. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:42 | admin | set | github: 61642 |
2013-03-17 16:09:35 | pitrou | set | status: open -> closed messages: + msg184386 |
2013-03-17 16:09:08 | pitrou | set | messages: + msg184385 |
2013-03-17 16:02:19 | gsingh | set | messages: + msg184383 |
2013-03-17 15:07:48 | pitrou | set | messages: + msg184380 |
2013-03-17 14:34:17 | gsingh | set | messages: + msg184378 |
2013-03-17 14:32:36 | gsingh | set | messages: + msg184376 |
2013-03-17 14:11:17 | pitrou | set | messages: + msg184375 |
2013-03-17 14:09:36 | gsingh | set | messages: + msg184374 |
2013-03-17 14:02:13 | gsingh | set | messages: + msg184371 |
2013-03-17 14:01:20 | gsingh | set | messages: + msg184370 |
2013-03-17 14:01:06 | pitrou | set | messages: + msg184369 |
2013-03-17 13:50:23 | gsingh | set | status: pending -> open messages: + msg184368 |
2013-03-16 20:50:54 | amaury.forgeotdarc | set | status: open -> pending resolution: not a bug |
2013-03-16 16:51:03 | pitrou | set | messages: + msg184332 |
2013-03-16 16:41:52 | serhiy.storchaka | set | nosy:
+ pitrou, benjamin.peterson, stutzbach, hynek components: + IO |
2013-03-16 16:33:23 | gsingh | create |