classification
Title: TextIOWrapper.tell extremely slow
Type: performance Stage: resolved
Components: IO Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Laurens, amaury.forgeotdarc, dsm001, eric.smith, pitrou
Priority: normal Keywords: patch

Created on 2011-02-04 08:52 by Laurens, last changed 2011-02-25 20:29 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
tell v01.py Laurens, 2011-02-04 10:40 example program
textiotell.patch pitrou, 2011-02-04 18:10
textiotell2.patch pitrou, 2011-02-04 20:06
Messages (15)
msg127874 - (view) Author: Laurens (Laurens) Date: 2011-02-04 08:52
file.tell() has become extremely slow in version 3.2, both rc1 and rc2. This problem did not exist in version 2.7.1, nor in version 3.1. It could be reproduced both on mac and windows xp.
msg127877 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2011-02-04 09:39
Do you have a benchmark program you can post?
msg127881 - (view) Author: Laurens (Laurens) Date: 2011-02-04 10:40
Correction: the problem also exists in version 3.1. I created a benchmark program an ran it on my machine (iMac, snow leopard 10.6), with the following results:

------------------------------------------
2.6.6 (r266:84292, Dec 30 2010, 09:20:14) 
[GCC 4.2.1 (Apple Inc. build 5664)]
result: 0.0009 s.
------------------------------------------
2.7.1 (r271:86832, Jan 13 2011, 07:38:03) 
[GCC 4.2.1 (Apple Inc. build 5664)]
result: 0.0008 s.
------------------------------------------
3.1.3 (r313:86882M, Nov 30 2010, 09:55:56) 
[GCC 4.0.1 (Apple Inc. build 5494)]
result: 9.5682 s.
------------------------------------------
3.2rc2 (r32rc2:88269, Jan 30 2011, 14:30:28) 
[GCC 4.2.1 (Apple Inc. build 5664)]
result: 8.3531 s.

Removing the line containing "tell" gives the following results:

------------------------------------------
2.6.6 (r266:84292, Dec 30 2010, 09:20:14) 
[GCC 4.2.1 (Apple Inc. build 5664)]
result: 0.0007 s.
------------------------------------------
2.7.1 (r271:86832, Jan 13 2011, 07:38:03) 
[GCC 4.2.1 (Apple Inc. build 5664)]
result: 0.0006 s.
------------------------------------------
3.1.3 (r313:86882M, Nov 30 2010, 09:55:56) 
[GCC 4.0.1 (Apple Inc. build 5494)]
result: 0.0093 s.
------------------------------------------
3.2rc2 (r32rc2:88269, Jan 30 2011, 14:30:28) 
[GCC 4.2.1 (Apple Inc. build 5664)]
result: 0.0007 s.


(Apparently, reading a file became a lot faster from 3.1 to 3.2.)

Conclusion: Execution of file.tell() makes the program about 10000 times slower.

Remark: the file mdutch.txt is a dummy text file containing 1000 lines with one word on each line.
msg127889 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2011-02-04 13:53
I found that adding "infile._CHUNK_SIZE = 20" makes the test much faster - 'only' 5 times slower than 2.7.
msg127892 - (view) Author: DSM (dsm001) Date: 2011-02-04 14:02
With a similar setup (OS X 10.6) I see the same problem.  It seems to go away if the file is opened in binary mode for reading.  @Laurens, can you confirm?
msg127893 - (view) Author: DSM (dsm001) Date: 2011-02-04 14:04
(By "go away" I mean "stop being pathological", not "stop differing": I still see a factor of 2.)
msg127894 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-02-04 14:10
That's expected. seek() and tell() on text (unicode) files are slow by construction. You should open your file in binary mode instead, if you want to do any seeking.

Maybe I should add a note in http://docs.python.org/dev/library/io.html#performance
msg127901 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-02-04 16:26
That said, I think it is possible to make algorithmic improvements to TextIOWrapper.tell() so that at least performance becomes acceptable.
msg127907 - (view) Author: Laurens (Laurens) Date: 2011-02-04 16:54
First of all, thanks to all for your cooperation, it is very much appreciated.

I made some minor changes to the benchmark program. Conclusions are:

 * setting file._CHUNK_SIZE to 20 has a dramatic effect, changing execution time in 3.2rc2 from 8.4s to 0.06s, so more than a factor 100 improvement. It's a bit clumsy, but acceptable as a performance work around.

 * opening file binary has a dramatic effect as well, I would say. After 2 minutes I stopped execution of the program, concluding that this change made the program at least a factor 10 *slower* instead of faster. So I cannot confirm DSM's statement that the performance hit would be a factor 2. Instead, I see a performance hit of at least a factor 100000 (10e5) compared tot 2.7.1, which is presumably not by construction ;-).
msg127908 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-02-04 16:57
> opening file binary has a dramatic effect as well, I would say. After 2 
> minutes I stopped execution of the program

Hint: b'' is not equal to '' ;)
msg127914 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-02-04 18:10
Here is a proof-of-concept patch for the pure Python version of TextIOWrapper.tell(). It turns the O(CHUNK_SIZE) operation into an O(1) operation most of time (still O(CHUNK_SIZE) worst-case - weird decoders and/or crazy input).
msg127915 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-02-04 18:19
> Here is a proof-of-concept patch for the pure Python version of
> TextIOWrapper.tell(). It turns the O(CHUNK_SIZE) operation into an
> O(1) operation most of time (still O(CHUNK_SIZE) worst-case - weird
> decoders and/or crazy input).

Actually, that's wrong. The patch is still O(CHUNK_SIZE) but with a
vastly different multiplier (and, optimistically, much smaller, as
common codecs are codecs in C).
msg127933 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-02-04 20:06
New patch also optimizing the C version. tell() can be more than 100x faster now (still much slower than binary tell()).
msg127934 - (view) Author: Laurens (Laurens) Date: 2011-02-04 20:09
All,

thanks for your help. Opening the file in binary mode worked immediately in the toy program (that is, the benchmark code I sent you). (Antoine, thanks for the hint.) In my real world program, I solved the problem by reading a line from a binary input file, and decode it explicitly and immediately to a string. The performance has become acceptable now, and I propose to close this issue.


Thanks again, cheers,

Laurens
msg129418 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-02-25 20:29
Committed in r88607 (3.3).
History
Date User Action Args
2011-02-25 20:29:37pitrousetstatus: open -> closed
nosy: amaury.forgeotdarc, pitrou, dsm001, eric.smith, Laurens
messages: + msg129418

resolution: accepted -> fixed
stage: patch review -> resolved
2011-02-04 20:09:29Laurenssetresolution: accepted
messages: + msg127934
nosy: amaury.forgeotdarc, pitrou, dsm001, eric.smith, Laurens
2011-02-04 20:06:01pitrousetfiles: + textiotell2.patch

messages: + msg127933
nosy: amaury.forgeotdarc, pitrou, dsm001, eric.smith, Laurens
stage: needs patch -> patch review
2011-02-04 18:20:00pitrousetnosy: amaury.forgeotdarc, pitrou, dsm001, eric.smith, Laurens
messages: + msg127915
2011-02-04 18:10:04pitrousetfiles: + textiotell.patch

messages: + msg127914
keywords: + patch
nosy: amaury.forgeotdarc, pitrou, dsm001, eric.smith, Laurens
2011-02-04 16:57:11pitrousetnosy: amaury.forgeotdarc, pitrou, dsm001, eric.smith, Laurens
messages: + msg127908
2011-02-04 16:54:35Laurenssetnosy: amaury.forgeotdarc, pitrou, dsm001, eric.smith, Laurens
messages: + msg127907
2011-02-04 16:26:08pitrousetversions: + Python 3.3, - Python 3.2
nosy: amaury.forgeotdarc, pitrou, dsm001, eric.smith, Laurens
title: file.tell extremely slow -> TextIOWrapper.tell extremely slow
messages: + msg127901

type: performance
stage: needs patch
2011-02-04 14:10:49pitrousetnosy: amaury.forgeotdarc, pitrou, dsm001, eric.smith, Laurens
messages: + msg127894
2011-02-04 14:04:58dsm001setnosy: amaury.forgeotdarc, pitrou, dsm001, eric.smith, Laurens
messages: + msg127893
2011-02-04 14:02:28dsm001setnosy: + dsm001
messages: + msg127892
2011-02-04 13:53:36amaury.forgeotdarcsetnosy: + amaury.forgeotdarc, pitrou
messages: + msg127889
2011-02-04 10:40:18Laurenssetfiles: + tell v01.py
nosy: eric.smith, Laurens
messages: + msg127881
2011-02-04 09:39:55eric.smithsetnosy: + eric.smith
messages: + msg127877
2011-02-04 08:52:36Laurenscreate