Message 52086 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	florianfesti
Recipients
Date	2007-03-07.17:57:08
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
This patch improves read performance of the gzip module. We have seen improvments from 20% from reading big blocks up to factor 50 for reading 4 byte blocks. Additionally this patch removes the need of seeking within the file with allows using streams like stdin as input. Details: The whole read(),readline() stack got rewritten. Core part is a new input buffer. It consists of a list of strings (self.buffer), an offset of what has already be consumed from the first string (self.pos) and the length of the (still consumable) buffer (self.bufferlen). This makes adding to and removing from the buffer cheap. It turned out that removing from the old buffer was breaking performance as for reading one byte the whole buffer had to be copied. For reading a 2k buffer in single bytes 2MB had to be copied. readline no longer uses read() but directly works on the buffer. This removes a whole layer of copying strings together. For removing the need of seeking a new readonly filelike class is used (PaddedFile). It just prepends a string to a file and uses the file's read method when the string got used up. There is probably still some space for tweaking when it comes to buffere sizes as we kept this simple. But the great performance improvments show that we can't be off that much. Performance test program and results are attached.

This patch improves read performance of the gzip module. We have seen improvments from 20% from reading big blocks up to factor 50 for reading 4 byte blocks. Additionally this patch removes the need of seeking within the file with allows using streams like stdin as input.

Details:

The whole read(),readline() stack got rewritten. Core part is a new input buffer. It consists of a list of strings (self.buffer), an offset of what has already be consumed from the first string (self.pos) and the length of the (still consumable) buffer (self.bufferlen). This makes adding to and removing from the buffer cheap. It turned out that removing from the old buffer was breaking performance as for reading one byte the whole buffer had to be copied. For reading a 2k buffer in single bytes 2MB had to be copied.

readline no longer uses read() but directly works on the buffer. This removes a whole layer of copying strings together.

For removing the need of seeking a new readonly filelike class is used (PaddedFile). It just prepends a string to a file and uses the file's read method when the string got used up.

There is probably still some space for tweaking when it comes to buffere sizes as we kept this simple. But the great performance improvments show that we can't be off that much.

Performance test program and results are attached.

History
Date	User	Action	Args
2007-08-23 15:57:20	admin	link	issue1675951 messages
2007-08-23 15:57:20	admin	create