Rewrite the IO stack in C #48815

ialbert · 2008-12-06T15:11:00Z

BPO	4565
Nosy	@rhettinger, @gpshead, @amauryfa, @pitrou, @giampaolo, @tiran, @devdanzin, @benjaminp
Dependencies	bpo-4263: BufferedWriter non-blocking overage bpo-4862: utf-16 BOM is not skipped after seek(0) bpo-4967: Bugs in _ssl object read() when a buffer is specified bpo-4996: io.TextIOWrapper calls buffer.read1() bpo-5006: Duplicate UTF-16 BOM if a file is open in append mode

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2009-03-04.21:29:08.414>
created_at = <Date 2008-12-06.15:10:59.693>
labels = ['extension-modules', 'library', 'performance']
title = 'Rewrite the IO stack in C'
updated_at = <Date 2009-03-04.21:30:46.095>
user = 'https://bugs.python.org/ialbert'

bugs.python.org fields:

activity = <Date 2009-03-04.21:30:46.095>
actor = 'benjamin.peterson'
assignee = 'none'
closed = True
closed_date = <Date 2009-03-04.21:29:08.414>
closer = 'benjamin.peterson'
components = ['Extension Modules', 'Library (Lib)']
creation = <Date 2008-12-06.15:10:59.693>
creator = 'ialbert'
dependencies = ['4263', '4862', '4967', '4996', '5006']
files = []
hgrepos = []
issue_num = 4565
keywords = []
message_count = 43.0
messages = ['77132', '77138', '77158', '77170', '77171', '77180', '78096', '80092', '82118', '82450', '82474', '82475', '82490', '82491', '82493', '82494', '82501', '82502', '82503', '82507', '82508', '82509', '82542', '82576', '82580', '82590', '82608', '82611', '82614', '82639', '82642', '82657', '82671', '82675', '82705', '82713', '82883', '82923', '82928', '83051', '83100', '83106', '83135']
nosy_count = 11.0
nosy_names = ['rhettinger', 'gregory.p.smith', 'exarkun', 'amaury.forgeotdarc', 'pitrou', 'giampaolo.rodola', 'christian.heimes', 'ajaksu2', 'benjamin.peterson', 'wplappert', 'ialbert']
pr_nums = []
priority = 'critical'
resolution = 'fixed'
stage = 'needs patch'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue4565'
versions = ['Python 3.1']

ialbert · 2008-12-06T15:10:59Z

The write performance into text files is substantially slower (5x-8x)
than that of python 2.5. This makes python 3.0 unsuited to any
application that needs to write larger amounts of data.

------------test code follows -----------------------

import time

lo, hi, step = 10**5, 10**6, 10**5

# writes increasingly more lines to a file
for N in range(lo, hi, step):
    fp = open('foodata.txt', 'wt')
    start = time.time()
    for i in range( N ):
        fp.write( '%s\n' % i)
    fp.close()
    stop = time.time()
    print ( "%s\t%s" % (N, stop-start) )

amauryfa · 2008-12-06T15:40:25Z

This is expected: the I/O stack has been completely rewritten... in
almost pure-python code.

There is a project to rewrite it in C. It started at
http://svn.python.org/view/sandbox/trunk/io-c/

ialbert · 2008-12-06T18:26:58Z

Well I would strongly dispute that anyone other than the developers
expected this. The release documentation states:

"The net result of the 3.0 generalizations is that Python 3.0 runs the
pystone benchmark around 10% slower than Python 2.5."

There is no indication of an order of magnitudes in read/write slowdown.
I believe that this issue is extremely serious! IO is an essential part
of a program, and today we live in the world of gigabytes of data. I am
reading reports of even more severe io slowdowns than what I saw:

http://bugs.python.org/issue4561

Java has had a hard time getting rid of the "it is very slow" stigma
even after getting a JIT compiler, so there is a danger there for a
lasting negative impression.

pitrou · 2008-12-06T21:52:36Z

Hi Amaury,

There is a project to rewrite it in C

Thanks for publicizing this. I'm a bit surprised by the adopted
approach. It seems you are merely translating the Python code into C. I
think the proper approach for the buffered IO classes would be to use a
fixed-size buffer which never gets reallocated.

If you look at bufferedwriter2.patch in bpo-3476, I had rewritten
BufferedWriter using a fixed-size buffer (although not for performance
reasons), I think it would be a good starting point for a C implementation.

tiran · 2008-12-06T22:01:04Z

For more bug reports see bpo-4533 and bpo-4561.

I suggest we close this bug report as duplicate and keep the discussion
in bpo-4561.

amauryfa · 2008-12-06T22:48:37Z

I'm a bit surprised by the adopted approach.
It seems you are merely translating the Python code into C.
I think the proper approach for the buffered IO classes would be
to use a fixed-size buffer which never gets reallocated.

You are certainly right, but the code io.py is already difficult to understand and
maintain; the corresponding C code adds one level of complexity;
had I changed the buffering strategy at the same time, it would have been impossible
to ensure a correct implementation.
Now that my C implementation of the Buffered classes seems correct (all tests pass,
except a few about destructors) we could try alternative approaches.

pitrou · 2008-12-20T12:46:42Z

We can't solve this for 3.0.1, downgrading to critical.

pitrou · 2009-01-18T12:11:40Z

The work to rewrite the IO stack in C will solve this problem as it will
probably solve most performance-related IO problems in py3k.

Amaury and I have been progressing a lot, the rewrite is now a real
branch in SVN at branches/io-c/. On this very issue, it is only 30%
slower than 2.x, which is quite good given the layered nature of the IO
stack and the fact that text IO does a lot more than in 2.x (it
translates newlines and encodes the text).

(actually, if I add an explicit .encode('utf8') call to the 2.x version
of the script, it becomes slower than our io-c rewrite)

pitrou · 2009-02-14T21:09:34Z

Issue depends on bpo-4967 which blocks use of memoryview objects with the
_ssl module.

benjaminp · 2009-02-19T03:36:02Z

This is basically going to be the killer feature in 3.1 ;). Therefore,
these are steps I think we need before we can merge the branch:

Fix the dependencies. (bpo-4967)
Resolve all outstanding issues with the IO lib on the io-c branch.
Rewrite the rest of StringIO in C?
Anything else I forgot?

pitrou · 2009-02-19T13:55:41Z

After "rewrite the rest of StringIO in C", there's "sanitize the
destructor behaviour of IOBase (if at all possible)".

pitrou · 2009-02-19T13:57:30Z

Oh, and "what to do of the now unused pure Python implementations in io.py"?
Easiest would be to dump them, as they will probably get hopelessly out
of sync, but perhaps there's some genuine portability/educational
advantage to keep them?

benjaminp · 2009-02-19T19:47:46Z

I think we should just drop the Python implementations. There's no point
in trying to keep two implementations around.

Besides, if we don't backport IO in C, we can maintain them in the trunk. :)

exarkun · 2009-02-19T19:57:33Z

Oh, and "what to do of the now unused pure Python implementations in
io.py"? Easiest would be to dump them, as they will probably get
hopelessly out of sync, but perhaps there's some genuine
portability/educational advantage to keep them?

The test suite should be run against both implementations. That way
tested behavior will always be the same for both. And all of its
behavior is tested, right? ;)

The value in the Python implementation is manifold. For example:

It eases testing of new features/techniques. Rather than going
straight to the C version when someone has an idea for a feature, it can
be implemented and tried out in Python. If it's cool, then the extra
effort of porting to C can be undertaken.
It helps other Python implementations immensely. PyPy, IronPython,
and Jython are all going to have to provide this library eventually (one
supposes). Forcing them each to re-implement it will mean it will be
that much longer before they support it.

benjaminp · 2009-02-19T20:01:12Z

On Thu, Feb 19, 2009 at 1:57 PM, Jean-Paul Calderone
<report@bugs.python.org> wrote:

Jean-Paul Calderone <exarkun@divmod.com> added the comment:

> Oh, and "what to do of the now unused pure Python implementations in
io.py"? Easiest would be to dump them, as they will probably get
hopelessly out of sync, but perhaps there's some genuine
portability/educational advantage to keep them?

The test suite should be run against both implementations. That way
tested behavior will always be the same for both. And all of its
behavior is tested, right? ;)

The value in the Python implementation is manifold. For example:

It eases testing of new features/techniques. Rather than going
straight to the C version when someone has an idea for a feature, it can
be implemented and tried out in Python. If it's cool, then the extra
effort of porting to C can be undertaken.

It helps other Python implementations immensely. PyPy, IronPython,
and Jython are all going to have to provide this library eventually (one
supposes). Forcing them each to re-implement it will mean it will be
that much longer before they support it.

We don't maintain any other features in two languages for those
purposes. IMO, it will just be more of a burden to fix bugs in two
different places as compared to the advantages you mention.

exarkun · 2009-02-19T20:05:51Z

We don't maintain any other features in two languages for those
purposes. IMO, it will just be more of a burden to fix bugs in two
different places as compared to the advantages you mention.

Surely the majority of the burden is imposed by the C implementation. I
expect that 90% of the time spent fixing bugs will be spent fixing them
in C. So for only a slightly increased maintenance cost, a massive
advantage is gained for other Python implementations. If the general
well-being and popularity of Python isn't a concern of CPython
developers, then perhaps the benefits can still be preserved at minimal
cost to the CPython developers by letting some Jython, IronPython, or
PyPy developers maintain the Python implementation of the io library in
the CPython source tree (rather than making them copy it elsewhere where
it will more frequently get out of sync, and where
Jython/IronPython/PyPy might waste effort in duplicating maintenance).

Or maybe none of them will care or object to the removal of the Python
version from CPython. It might at least be worth asking first, though.

pitrou · 2009-02-19T22:10:00Z

Hello JP,

Surely the majority of the burden is imposed by the C implementation. I
expect that 90% of the time spent fixing bugs will be spent fixing them
in C.

Hmm, it depends. It's probably true in general, but I suspect a fair
amount of work also went into getting the Python implementation correct,
since there are things in there that are tricky regardless of the
implementation language (I'm especially thinking of the TextIOWrapper
seek() and tell() methods).
(and there are still bugs in the Python implementation btw.)

If the general
well-being and popularity of Python isn't a concern of CPython
developers, then perhaps the benefits can still be preserved at minimal
cost to the CPython developers by letting some Jython, IronPython, or
PyPy developers maintain the Python implementation of the io library in
the CPython source tree

Well, if it is part of the CPython source tree, we (CPython developers)
can't realistically ignore it by saying it's someone else's job.

Or maybe none of them will care or object to the removal of the Python
version from CPython. It might at least be worth asking first, though.

In any case, it must first be asked on python-dev. We're not gonna dump
the code without telling anybody anything :)

cheers

Antoine.

rhettinger · 2009-02-19T22:32:19Z

[Benjamin Peterson]

I think we should just drop the Python implementations. There's no point
in trying to keep two implementations around.

I disagree. I've found great value in keeping a pure python version
around for things I've converted to C. The former serves as
documentation, as a tool for other implementations (like PyPy
IronPython, and Jython), and as a precise spec. The latter case
is especially valuable (otherwise, the spec becomes whatever
CPython happens to do).

Also, I've found that once the two are in-sync, keeping it that way
isn't hard. And, there effort for keeping them in-sync is a good
way to find bugs.

In the heapqmodule, we do a little magic in the test suite to
make sure the tests are run against both. It's not hard.

Raymond

exarkun · 2009-02-19T22:34:31Z

Hi Antoine,

> Surely the majority of the burden is imposed by the C
implementation. I expect that 90% of the time spent fixing bugs will be
spent fixing them in C.

Hmm, it depends. It's probably true in general, but I suspect a fair
amount of work also went into getting the Python implementation correct,
since there are things in there that are tricky regardless of the
implementation language (I'm especially thinking of the TextOWrapper
seek() and tell() methods). (and there are still bugs in the Python
implementation btw.)

Indeed, I'm sure a lot of work went into the Python implementation - and
hopefully that work *saved* a huge amount of work when doing the C
implementation. That's why people prototype things in Python, right? :)
So it seems to me that keeping the Python implementation is useful for
CPython, since if it made working on the C implementation easier in the
past, it will probably do so again in the future.

Basically, my point is that maintaining C and Python versions is
*cheaper* than just maintaining the C version alone. The stuff I said
about other VMs is true too, but it doesn't seem like anyone here is
going to be convinced by it ;) (and I haven't spoked to any developers
for other VMs about whether they really want it, anyway).

rhettinger · 2009-02-19T23:30:21Z

Basically, my point is that maintaining C and Python
versions is *cheaper* than just maintaining the C
version alone.

Well said.

gpshead · 2009-02-19T23:39:48Z

+1 to setting it up so that unit tests are always run against both and
keeping both.

pitrou · 2009-02-19T23:49:07Z

+1 to setting it up so that unit tests are always run against both and
keeping both.

If this is the way forward I recommend putting the pure Python versions
into a separate module, eg pyio.py (although the name is not very
elegant). It will make the separation clean and obvious.

(and perhaps it will have the side-effect of improving startup time,
although I'm not really worried about this)

benjaminp · 2009-02-20T19:27:46Z

It seems the decision of Python-dev is to keep both implementations.
We'll stuff the python one in _pyio and rewrite the tests to test both.
I'll see if I can get to this this weekend.

pitrou · 2009-02-21T19:12:11Z

The StringIO rewrite is finished now.

benjaminp · 2009-02-21T20:09:06Z

Ok. I've split the Python io implementation into the _pyio module and
rewritten the tests. All the C ones are passing, but some Python
implementation ones are failing.

benjaminp · 2009-02-22T01:13:11Z

Ok. I've fixed all the tests except
PyBufferedRandomTest.testFlushAndPeek and the garbage collections ones.

pitrou · 2009-02-22T19:50:52Z

What should we do about test_fileio, test_file and test_bufio?

benjaminp · 2009-02-22T21:31:59Z

On Sun, Feb 22, 2009 at 1:50 PM, Antoine Pitrou <report@bugs.python.org> wrote:

Antoine Pitrou <pitrou@free.fr> added the comment:

What should we do about test_fileio, test_file and test_bufio?

I changed test_file and test_bufio to test the open() implementations
of each library. test_fileio should be fine because the implementation
is the same for _pyio and io.

pitrou · 2009-02-22T23:00:37Z

There's also test_univnewlines, I think.

pitrou · 2009-02-23T20:12:50Z

Oh, and test_largefile and test_debussy as well :)

Le dimanche 22 février 2009 à 23:00 +0000, Antoine Pitrou a écrit :

Antoine Pitrou <pitrou@free.fr> added the comment:

There's also test_univnewlines, I think.

pitrou · 2009-02-23T20:26:11Z

test_largefile is done.
One more question: what shall we do with _pyio.OpenWrapper?
Should it become the default exported "open" object?

benjaminp · 2009-02-24T03:11:26Z

On Mon, Feb 23, 2009 at 2:26 PM, Antoine Pitrou <report@bugs.python.org> wrote:

test_largefile is done.

Thanks.

One more question: what shall we do with _pyio.OpenWrapper?
Should it become the default exported "open" object?

No, I think it was just meant to be used when _pyio is the builtin
open implementation.

benjaminp · 2009-02-24T19:17:00Z

We also have to figure out how to make the C IOBase a ABC, so people can
implement it.

pitrou · 2009-02-24T19:51:26Z

We also have to figure out how to make the C IOBase a ABC, so people can
implement it.

Mmmh, I know absolutely nothing about the ABC implementation.

pitrou · 2009-02-25T15:15:00Z

I just took a quick look at Lib/abc.py and there's no way *I*'ll
reimplement it in C :)

The only workable approach would be:

rename the current would-be ABCs (IOBase, RawIOBase, etc.) with a
leading underscore (_IOBase, _RawIOBase, etc.)
call abc.ABCMeta() with the right arguments to create heap-types
derived from those base types
call XXXIOBase.register() with each of the concrete classes
(BufferedReader, etc.) to register them with the ABCs created in 2

That is, do something like the following:

>> IOBase = abc.ABCMeta("IOBase", (_io.IOBase,), {})
>> RawIOBase = type("RawIOBase", (_io.RawIOBase, IOBase), {})
>> RawIOBase.register(_io.FileIO)
>> TextIOBase = type("TextIOBase", (_io.TextIOBase, IOBase), {})
>> TextIOBase.register(_io.TextIOWrapper)

Which gives:
>>> f = open('foobar', 'wb', buffering=0)
>>> isinstance(f, RawIOBase)
True
>>> isinstance(f, IOBase)
True
>>> f = open('foobar', 'w')
>>> isinstance(f, IOBase)
True
>>> isinstance(f, TextIOBase)
True
>>> isinstance(f, RawIOBase)
False

As you see, RawIOBase inherits both from IOBase (the ABC, for ABC-ness)
and _RawIOBase (the concrete non-ABC implementation). Implementation
classes like FileIO don't need to explicitly inherit the ABCs, only to
register with them.

Also, writing a Python implementation still inherits the
close-on-destroy behaviour:

>>> class S(RawIOBase):
...   def close(self):
...     print("closing")
... 
>>> s = S()
>>> del s
closing
>>>

Perhaps we could even do all this in Python in io.py?

benjaminp · 2009-02-25T19:34:49Z

On Wed, Feb 25, 2009 at 10:15 AM, Antoine Pitrou <report@bugs.python.org> wrote:

Antoine Pitrou <pitrou@free.fr> added the comment:

I just took a quick look at Lib/abc.py and there's no way *I*'ll
reimplement it in C :)

I don't blame you for that. :)

The only workable approach would be:

rename the current would-be ABCs (IOBase, RawIOBase, etc.) with a
leading underscore (_IOBase, _RawIOBase, etc.)

call abc.ABCMeta() with the right arguments to create heap-types
derived from those base types

call XXXIOBase.register() with each of the concrete classes
(BufferedReader, etc.) to register them with the ABCs created in 2

I think this is the best solution. We could also just move the Python
ABC's from _pyio to io.py and register() all the C IO classes, but
that would prevent the C implementation of IOBase from being used.

pitrou · 2009-02-27T22:37:55Z

Ok, so the ABC stuff is done now.
Remaining:

fix the test failures with the Python implementation
the _ssl bug

benjaminp · 2009-02-28T17:12:32Z

I just fixed the last failing test_io.

(I'm listing as dependencies issues we can close after the branch is
merged.)

benjaminp · 2009-02-28T17:29:02Z

These StringIO bugs should be dealt with:

bpo-5264
bpo-5265
bpo-5266

pitrou · 2009-03-03T00:24:24Z

Reviewers: ,

Description:
The diff between the py3k and io-c branches, for whoever wants to review
it.

Please review this at http://codereview.appspot.com/22061

Affected files:
Doc/library/io.rst
Lib/_pyio.py
Lib/importlib/init.py
Lib/importlib/_bootstrap.py
Lib/io.py
Lib/test/test_bufio.py
Lib/test/test_descr.py
Lib/test/test_file.py
Lib/test/test_fileio.py
Lib/test/test_io.py
Lib/test/test_largefile.py
Lib/test/test_memoryio.py
Lib/test/test_univnewlines.py
Lib/test/test_uu.py
Makefile.pre.in
Modules/Setup.dist
Modules/_bufferedio.c
Modules/_bytesio.c
Modules/_fileio.c
Modules/_iobase.c
Modules/_iomodule.h
Modules/_stringio.c
Modules/_textio.c
Modules/io.c
PC/VC6/pythoncore.dsp
PC/config.c
PCbuild/pythoncore.vcproj
Python/pythonrun.c
setup.py

devdanzin · 2009-03-03T21:09:34Z

A couple of typos in the Python implementation.

http://codereview.appspot.com/22061/diff/1/11
File Lib/_pyio.py (right):

http://codereview.appspot.com/22061/diff/1/11#newcode266
Line 266: fp is closed after the suite of the with statment is complete:
statment -> statement

http://codereview.appspot.com/22061/diff/1/11#newcode844
Line 844: self._reset_read_buf()
Setting "_read_buf" and "_read_pos" directly on __init__ may help
introspection tools.

http://codereview.appspot.com/22061/diff/1/11#newcode963
Line 963: DEAFULT_BUFFER_SIZE. If max_buffer_size is omitted, it
defaults to
DEAFULT_BUFFER_SIZE -> DEFAULT_BUFFER_SIZE

http://codereview.appspot.com/22061/diff/1/11#newcode1728
Line 1728: decoder = self._decoder or self._get_decoder()
'decoder' isn't used in this method, is this here for an useful
side-effect?

http://codereview.appspot.com/22061/diff/1/11#newcode1784
Line 1784: more_line = ''
This seems unused.

http://codereview.appspot.com/22061

benjaminp · 2009-03-03T21:47:43Z

2009/3/3 Daniel Diniz <report@bugs.python.org>:

A couple of typos in the Python implementation.

Thanks for taking a look! Fixed these things in r70135.

http://codereview.appspot.com/22061/diff/1/11#newcode844
Line 844: self._reset_read_buf()
Setting "_read_buf" and "_read_pos" directly on __init__ may help
introspection tools.

Perhaps, but I think it duplicates too much of _reset_read_buf(). And
it wouldn't damage introspection, just static analysis.

http://codereview.appspot.com/22061/diff/1/11#newcode1728
Line 1728: decoder = self._decoder or self._get_decoder()
'decoder' isn't used in this method, is this here for an useful
side-effect?

Yes, it's for side affect, but it needn't be in a variable.

benjaminp · 2009-03-04T21:29:08Z

And the io-c branch has been merged in r70152.

ialbert mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Dec 6, 2008

amauryfa self-assigned this Dec 6, 2008

warsaw added the release-blocker label Dec 6, 2008

loewis mannequin added deferred-blocker and removed release-blocker labels Dec 10, 2008

loewis mannequin added release-blocker and removed deferred-blocker labels Dec 20, 2008

pitrou removed the release-blocker label Dec 20, 2008

pitrou added extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Jan 18, 2009

pitrou changed the title ~~io write() performance very slow~~ Rewrite the IO stack in C Jan 18, 2009

pitrou unassigned amauryfa Feb 14, 2009

benjaminp closed this as completed Mar 4, 2009

ezio-melotti transferred this issue from another repository Apr 10, 2022

Rewrite the IO stack in C #48815

Rewrite the IO stack in C #48815

Comments

ialbert mannequin commented Dec 6, 2008

ialbert mannequin commented Dec 6, 2008

amauryfa commented Dec 6, 2008

ialbert mannequin commented Dec 6, 2008

pitrou commented Dec 6, 2008

tiran commented Dec 6, 2008

amauryfa commented Dec 6, 2008

pitrou commented Dec 20, 2008

pitrou commented Jan 18, 2009

pitrou commented Feb 14, 2009

benjaminp commented Feb 19, 2009

pitrou commented Feb 19, 2009

pitrou commented Feb 19, 2009

benjaminp commented Feb 19, 2009

exarkun mannequin commented Feb 19, 2009

benjaminp commented Feb 19, 2009

exarkun mannequin commented Feb 19, 2009

pitrou commented Feb 19, 2009

rhettinger commented Feb 19, 2009

exarkun mannequin commented Feb 19, 2009

rhettinger commented Feb 19, 2009

gpshead commented Feb 19, 2009

pitrou commented Feb 19, 2009

benjaminp commented Feb 20, 2009

pitrou commented Feb 21, 2009

benjaminp commented Feb 21, 2009

benjaminp commented Feb 22, 2009

pitrou commented Feb 22, 2009

benjaminp commented Feb 22, 2009

pitrou commented Feb 22, 2009

pitrou commented Feb 23, 2009

pitrou commented Feb 23, 2009

benjaminp commented Feb 24, 2009

benjaminp commented Feb 24, 2009

pitrou commented Feb 24, 2009

pitrou commented Feb 25, 2009

benjaminp commented Feb 25, 2009

pitrou commented Feb 27, 2009

benjaminp commented Feb 28, 2009

benjaminp commented Feb 28, 2009

pitrou commented Mar 3, 2009

devdanzin mannequin commented Mar 3, 2009

benjaminp commented Mar 3, 2009

benjaminp commented Mar 4, 2009