Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite the IO stack in C #48815

Closed
ialbert mannequin opened this issue Dec 6, 2008 · 43 comments
Closed

Rewrite the IO stack in C #48815

ialbert mannequin opened this issue Dec 6, 2008 · 43 comments
Labels
extension-modules C modules in the Modules dir performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@ialbert
Copy link
Mannequin

ialbert mannequin commented Dec 6, 2008

BPO 4565
Nosy @rhettinger, @gpshead, @amauryfa, @pitrou, @giampaolo, @tiran, @devdanzin, @benjaminp
Dependencies
  • bpo-4263: BufferedWriter non-blocking overage
  • bpo-4862: utf-16 BOM is not skipped after seek(0)
  • bpo-4967: Bugs in _ssl object read() when a buffer is specified
  • bpo-4996: io.TextIOWrapper calls buffer.read1()
  • bpo-5006: Duplicate UTF-16 BOM if a file is open in append mode
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2009-03-04.21:29:08.414>
    created_at = <Date 2008-12-06.15:10:59.693>
    labels = ['extension-modules', 'library', 'performance']
    title = 'Rewrite the IO stack in C'
    updated_at = <Date 2009-03-04.21:30:46.095>
    user = 'https://bugs.python.org/ialbert'

    bugs.python.org fields:

    activity = <Date 2009-03-04.21:30:46.095>
    actor = 'benjamin.peterson'
    assignee = 'none'
    closed = True
    closed_date = <Date 2009-03-04.21:29:08.414>
    closer = 'benjamin.peterson'
    components = ['Extension Modules', 'Library (Lib)']
    creation = <Date 2008-12-06.15:10:59.693>
    creator = 'ialbert'
    dependencies = ['4263', '4862', '4967', '4996', '5006']
    files = []
    hgrepos = []
    issue_num = 4565
    keywords = []
    message_count = 43.0
    messages = ['77132', '77138', '77158', '77170', '77171', '77180', '78096', '80092', '82118', '82450', '82474', '82475', '82490', '82491', '82493', '82494', '82501', '82502', '82503', '82507', '82508', '82509', '82542', '82576', '82580', '82590', '82608', '82611', '82614', '82639', '82642', '82657', '82671', '82675', '82705', '82713', '82883', '82923', '82928', '83051', '83100', '83106', '83135']
    nosy_count = 11.0
    nosy_names = ['rhettinger', 'gregory.p.smith', 'exarkun', 'amaury.forgeotdarc', 'pitrou', 'giampaolo.rodola', 'christian.heimes', 'ajaksu2', 'benjamin.peterson', 'wplappert', 'ialbert']
    pr_nums = []
    priority = 'critical'
    resolution = 'fixed'
    stage = 'needs patch'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue4565'
    versions = ['Python 3.1']

    @ialbert
    Copy link
    Mannequin Author

    ialbert mannequin commented Dec 6, 2008

    The write performance into text files is substantially slower (5x-8x)
    than that of python 2.5. This makes python 3.0 unsuited to any
    application that needs to write larger amounts of data.

    ------------test code follows -----------------------

    import time

    lo, hi, step = 10**5, 10**6, 10**5

    # writes increasingly more lines to a file
    for N in range(lo, hi, step):
        fp = open('foodata.txt', 'wt')
        start = time.time()
        for i in range( N ):
            fp.write( '%s\n' % i)
        fp.close()
        stop = time.time()
        print ( "%s\t%s" % (N, stop-start) )

    @ialbert ialbert mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Dec 6, 2008
    @amauryfa
    Copy link
    Member

    amauryfa commented Dec 6, 2008

    This is expected: the I/O stack has been completely rewritten... in
    almost pure-python code.

    There is a project to rewrite it in C. It started at
    http://svn.python.org/view/sandbox/trunk/io-c/

    @amauryfa amauryfa self-assigned this Dec 6, 2008
    @ialbert
    Copy link
    Mannequin Author

    ialbert mannequin commented Dec 6, 2008

    Well I would strongly dispute that anyone other than the developers
    expected this. The release documentation states:

    "The net result of the 3.0 generalizations is that Python 3.0 runs the
    pystone benchmark around 10% slower than Python 2.5."

    There is no indication of an order of magnitudes in read/write slowdown.
    I believe that this issue is extremely serious! IO is an essential part
    of a program, and today we live in the world of gigabytes of data. I am
    reading reports of even more severe io slowdowns than what I saw:

    http://bugs.python.org/issue4561

    Java has had a hard time getting rid of the "it is very slow" stigma
    even after getting a JIT compiler, so there is a danger there for a
    lasting negative impression.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 6, 2008

    Hi Amaury,

    There is a project to rewrite it in C

    Thanks for publicizing this. I'm a bit surprised by the adopted
    approach. It seems you are merely translating the Python code into C. I
    think the proper approach for the buffered IO classes would be to use a
    fixed-size buffer which never gets reallocated.

    If you look at bufferedwriter2.patch in bpo-3476, I had rewritten
    BufferedWriter using a fixed-size buffer (although not for performance
    reasons), I think it would be a good starting point for a C implementation.

    @tiran
    Copy link
    Member

    tiran commented Dec 6, 2008

    For more bug reports see bpo-4533 and bpo-4561.

    I suggest we close this bug report as duplicate and keep the discussion
    in bpo-4561.

    @amauryfa
    Copy link
    Member

    amauryfa commented Dec 6, 2008

    I'm a bit surprised by the adopted approach.
    It seems you are merely translating the Python code into C.
    I think the proper approach for the buffered IO classes would be
    to use a fixed-size buffer which never gets reallocated.

    You are certainly right, but the code io.py is already difficult to understand and
    maintain; the corresponding C code adds one level of complexity;
    had I changed the buffering strategy at the same time, it would have been impossible
    to ensure a correct implementation.
    Now that my C implementation of the Buffered classes seems correct (all tests pass,
    except a few about destructors) we could try alternative approaches.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 20, 2008

    We can't solve this for 3.0.1, downgrading to critical.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 18, 2009

    The work to rewrite the IO stack in C will solve this problem as it will
    probably solve most performance-related IO problems in py3k.

    Amaury and I have been progressing a lot, the rewrite is now a real
    branch in SVN at branches/io-c/. On this very issue, it is only 30%
    slower than 2.x, which is quite good given the layered nature of the IO
    stack and the fact that text IO does a lot more than in 2.x (it
    translates newlines and encodes the text).

    (actually, if I add an explicit .encode('utf8') call to the 2.x version
    of the script, it becomes slower than our io-c rewrite)

    @pitrou pitrou added extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Jan 18, 2009
    @pitrou pitrou changed the title io write() performance very slow Rewrite the IO stack in C Jan 18, 2009
    @pitrou
    Copy link
    Member

    pitrou commented Feb 14, 2009

    Issue depends on bpo-4967 which blocks use of memoryview objects with the
    _ssl module.

    @benjaminp
    Copy link
    Contributor

    This is basically going to be the killer feature in 3.1 ;). Therefore,
    these are steps I think we need before we can merge the branch:

    • Fix the dependencies. (bpo-4967)
    • Resolve all outstanding issues with the IO lib on the io-c branch.
    • Rewrite the rest of StringIO in C?
    • Anything else I forgot?

    @pitrou
    Copy link
    Member

    pitrou commented Feb 19, 2009

    After "rewrite the rest of StringIO in C", there's "sanitize the
    destructor behaviour of IOBase (if at all possible)".

    @pitrou
    Copy link
    Member

    pitrou commented Feb 19, 2009

    Oh, and "what to do of the now unused pure Python implementations in io.py"?
    Easiest would be to dump them, as they will probably get hopelessly out
    of sync, but perhaps there's some genuine portability/educational
    advantage to keep them?

    @benjaminp
    Copy link
    Contributor

    I think we should just drop the Python implementations. There's no point
    in trying to keep two implementations around.

    Besides, if we don't backport IO in C, we can maintain them in the trunk. :)

    @exarkun
    Copy link
    Mannequin

    exarkun mannequin commented Feb 19, 2009

    Oh, and "what to do of the now unused pure Python implementations in
    io.py"? Easiest would be to dump them, as they will probably get
    hopelessly out of sync, but perhaps there's some genuine
    portability/educational advantage to keep them?

    The test suite should be run against both implementations. That way
    tested behavior will always be the same for both. And all of its
    behavior is tested, right? ;)

    The value in the Python implementation is manifold. For example:

    • It eases testing of new features/techniques. Rather than going
      straight to the C version when someone has an idea for a feature, it can
      be implemented and tried out in Python. If it's cool, then the extra
      effort of porting to C can be undertaken.
    • It helps other Python implementations immensely. PyPy, IronPython,
      and Jython are all going to have to provide this library eventually (one
      supposes). Forcing them each to re-implement it will mean it will be
      that much longer before they support it.

    @benjaminp
    Copy link
    Contributor

    On Thu, Feb 19, 2009 at 1:57 PM, Jean-Paul Calderone
    <report@bugs.python.org> wrote:

    Jean-Paul Calderone <exarkun@divmod.com> added the comment:

    > Oh, and "what to do of the now unused pure Python implementations in
    io.py"? Easiest would be to dump them, as they will probably get
    hopelessly out of sync, but perhaps there's some genuine
    portability/educational advantage to keep them?

    The test suite should be run against both implementations. That way
    tested behavior will always be the same for both. And all of its
    behavior is tested, right? ;)

    The value in the Python implementation is manifold. For example:

    • It eases testing of new features/techniques. Rather than going
      straight to the C version when someone has an idea for a feature, it can
      be implemented and tried out in Python. If it's cool, then the extra
      effort of porting to C can be undertaken.
    • It helps other Python implementations immensely. PyPy, IronPython,
      and Jython are all going to have to provide this library eventually (one
      supposes). Forcing them each to re-implement it will mean it will be
      that much longer before they support it.

    We don't maintain any other features in two languages for those
    purposes. IMO, it will just be more of a burden to fix bugs in two
    different places as compared to the advantages you mention.

    @exarkun
    Copy link
    Mannequin

    exarkun mannequin commented Feb 19, 2009

    We don't maintain any other features in two languages for those
    purposes. IMO, it will just be more of a burden to fix bugs in two
    different places as compared to the advantages you mention.

    Surely the majority of the burden is imposed by the C implementation. I
    expect that 90% of the time spent fixing bugs will be spent fixing them
    in C. So for only a slightly increased maintenance cost, a massive
    advantage is gained for other Python implementations. If the general
    well-being and popularity of Python isn't a concern of CPython
    developers, then perhaps the benefits can still be preserved at minimal
    cost to the CPython developers by letting some Jython, IronPython, or
    PyPy developers maintain the Python implementation of the io library in
    the CPython source tree (rather than making them copy it elsewhere where
    it will more frequently get out of sync, and where
    Jython/IronPython/PyPy might waste effort in duplicating maintenance).

    Or maybe none of them will care or object to the removal of the Python
    version from CPython. It might at least be worth asking first, though.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 19, 2009

    Hello JP,

    Surely the majority of the burden is imposed by the C implementation. I
    expect that 90% of the time spent fixing bugs will be spent fixing them
    in C.

    Hmm, it depends. It's probably true in general, but I suspect a fair
    amount of work also went into getting the Python implementation correct,
    since there are things in there that are tricky regardless of the
    implementation language (I'm especially thinking of the TextIOWrapper
    seek() and tell() methods).
    (and there are still bugs in the Python implementation btw.)

    If the general
    well-being and popularity of Python isn't a concern of CPython
    developers, then perhaps the benefits can still be preserved at minimal
    cost to the CPython developers by letting some Jython, IronPython, or
    PyPy developers maintain the Python implementation of the io library in
    the CPython source tree

    Well, if it is part of the CPython source tree, we (CPython developers)
    can't realistically ignore it by saying it's someone else's job.

    Or maybe none of them will care or object to the removal of the Python
    version from CPython. It might at least be worth asking first, though.

    In any case, it must first be asked on python-dev. We're not gonna dump
    the code without telling anybody anything :)

    cheers

    Antoine.

    @rhettinger
    Copy link
    Contributor

    [Benjamin Peterson]

    I think we should just drop the Python implementations. There's no point
    in trying to keep two implementations around.

    I disagree. I've found great value in keeping a pure python version
    around for things I've converted to C. The former serves as
    documentation, as a tool for other implementations (like PyPy
    IronPython, and Jython), and as a precise spec. The latter case
    is especially valuable (otherwise, the spec becomes whatever
    CPython happens to do).

    Also, I've found that once the two are in-sync, keeping it that way
    isn't hard. And, there effort for keeping them in-sync is a good
    way to find bugs.

    In the heapqmodule, we do a little magic in the test suite to
    make sure the tests are run against both. It's not hard.

    Raymond

    @exarkun
    Copy link
    Mannequin

    exarkun mannequin commented Feb 19, 2009

    Hi Antoine,

    > Surely the majority of the burden is imposed by the C
    implementation. I expect that 90% of the time spent fixing bugs will be
    spent fixing them in C.

    Hmm, it depends. It's probably true in general, but I suspect a fair
    amount of work also went into getting the Python implementation correct,
    since there are things in there that are tricky regardless of the
    implementation language (I'm especially thinking of the TextOWrapper
    seek() and tell() methods). (and there are still bugs in the Python
    implementation btw.)

    Indeed, I'm sure a lot of work went into the Python implementation - and
    hopefully that work *saved* a huge amount of work when doing the C
    implementation. That's why people prototype things in Python, right? :)
    So it seems to me that keeping the Python implementation is useful for
    CPython, since if it made working on the C implementation easier in the
    past, it will probably do so again in the future.

    Basically, my point is that maintaining C and Python versions is
    *cheaper* than just maintaining the C version alone. The stuff I said
    about other VMs is true too, but it doesn't seem like anyone here is
    going to be convinced by it ;) (and I haven't spoked to any developers
    for other VMs about whether they really want it, anyway).

    @rhettinger
    Copy link
    Contributor

    Basically, my point is that maintaining C and Python
    versions is *cheaper* than just maintaining the C
    version alone.

    Well said.

    @gpshead
    Copy link
    Member

    gpshead commented Feb 19, 2009

    +1 to setting it up so that unit tests are always run against both and
    keeping both.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 19, 2009

    +1 to setting it up so that unit tests are always run against both and
    keeping both.

    If this is the way forward I recommend putting the pure Python versions
    into a separate module, eg pyio.py (although the name is not very
    elegant). It will make the separation clean and obvious.

    (and perhaps it will have the side-effect of improving startup time,
    although I'm not really worried about this)

    @benjaminp
    Copy link
    Contributor

    It seems the decision of Python-dev is to keep both implementations.
    We'll stuff the python one in _pyio and rewrite the tests to test both.
    I'll see if I can get to this this weekend.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 21, 2009

    The StringIO rewrite is finished now.

    @benjaminp
    Copy link
    Contributor

    Ok. I've split the Python io implementation into the _pyio module and
    rewritten the tests. All the C ones are passing, but some Python
    implementation ones are failing.

    @benjaminp
    Copy link
    Contributor

    Ok. I've fixed all the tests except
    PyBufferedRandomTest.testFlushAndPeek and the garbage collections ones.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 22, 2009

    What should we do about test_fileio, test_file and test_bufio?

    @benjaminp
    Copy link
    Contributor

    On Sun, Feb 22, 2009 at 1:50 PM, Antoine Pitrou <report@bugs.python.org> wrote:

    Antoine Pitrou <pitrou@free.fr> added the comment:

    What should we do about test_fileio, test_file and test_bufio?

    I changed test_file and test_bufio to test the open() implementations
    of each library. test_fileio should be fine because the implementation
    is the same for _pyio and io.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 22, 2009

    There's also test_univnewlines, I think.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 23, 2009

    Oh, and test_largefile and test_debussy as well :)

    Le dimanche 22 février 2009 à 23:00 +0000, Antoine Pitrou a écrit :

    Antoine Pitrou <pitrou@free.fr> added the comment:

    There's also test_univnewlines, I think.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 23, 2009

    test_largefile is done.
    One more question: what shall we do with _pyio.OpenWrapper?
    Should it become the default exported "open" object?

    @benjaminp
    Copy link
    Contributor

    On Mon, Feb 23, 2009 at 2:26 PM, Antoine Pitrou <report@bugs.python.org> wrote:

    test_largefile is done.

    Thanks.

    One more question: what shall we do with _pyio.OpenWrapper?
    Should it become the default exported "open" object?

    No, I think it was just meant to be used when _pyio is the builtin
    open implementation.

    @benjaminp
    Copy link
    Contributor

    We also have to figure out how to make the C IOBase a ABC, so people can
    implement it.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 24, 2009

    We also have to figure out how to make the C IOBase a ABC, so people can
    implement it.

    Mmmh, I know absolutely nothing about the ABC implementation.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 25, 2009

    I just took a quick look at Lib/abc.py and there's no way *I*'ll
    reimplement it in C :)

    The only workable approach would be:

    1. rename the current would-be ABCs (IOBase, RawIOBase, etc.) with a
      leading underscore (_IOBase, _RawIOBase, etc.)
    2. call abc.ABCMeta() with the right arguments to create heap-types
      derived from those base types
    3. call XXXIOBase.register() with each of the concrete classes
      (BufferedReader, etc.) to register them with the ABCs created in 2

    That is, do something like the following:

    >> IOBase = abc.ABCMeta("IOBase", (_io.IOBase,), {})
    >> RawIOBase = type("RawIOBase", (_io.RawIOBase, IOBase), {})
    >> RawIOBase.register(_io.FileIO)
    >> TextIOBase = type("TextIOBase", (_io.TextIOBase, IOBase), {})
    >> TextIOBase.register(_io.TextIOWrapper)

    Which gives:
    >>> f = open('foobar', 'wb', buffering=0)
    >>> isinstance(f, RawIOBase)
    True
    >>> isinstance(f, IOBase)
    True
    >>> f = open('foobar', 'w')
    >>> isinstance(f, IOBase)
    True
    >>> isinstance(f, TextIOBase)
    True
    >>> isinstance(f, RawIOBase)
    False

    As you see, RawIOBase inherits both from IOBase (the ABC, for ABC-ness)
    and _RawIOBase (the concrete non-ABC implementation). Implementation
    classes like FileIO don't need to explicitly inherit the ABCs, only to
    register with them.

    Also, writing a Python implementation still inherits the
    close-on-destroy behaviour:

    >>> class S(RawIOBase):
    ...   def close(self):
    ...     print("closing")
    ... 
    >>> s = S()
    >>> del s
    closing
    >>> 

    Perhaps we could even do all this in Python in io.py?

    @benjaminp
    Copy link
    Contributor

    On Wed, Feb 25, 2009 at 10:15 AM, Antoine Pitrou <report@bugs.python.org> wrote:

    Antoine Pitrou <pitrou@free.fr> added the comment:

    I just took a quick look at Lib/abc.py and there's no way *I*'ll
    reimplement it in C :)

    I don't blame you for that. :)

    The only workable approach would be:

    1. rename the current would-be ABCs (IOBase, RawIOBase, etc.) with a
      leading underscore (_IOBase, _RawIOBase, etc.)
    2. call abc.ABCMeta() with the right arguments to create heap-types
      derived from those base types
    3. call XXXIOBase.register() with each of the concrete classes
      (BufferedReader, etc.) to register them with the ABCs created in 2

    I think this is the best solution. We could also just move the Python
    ABC's from _pyio to io.py and register() all the C IO classes, but
    that would prevent the C implementation of IOBase from being used.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 27, 2009

    Ok, so the ABC stuff is done now.
    Remaining:

    • fix the test failures with the Python implementation
    • the _ssl bug

    @benjaminp
    Copy link
    Contributor

    I just fixed the last failing test_io.

    (I'm listing as dependencies issues we can close after the branch is
    merged.)

    @benjaminp
    Copy link
    Contributor

    These StringIO bugs should be dealt with:

    bpo-5264
    bpo-5265
    bpo-5266

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Mar 3, 2009

    A couple of typos in the Python implementation.

    http://codereview.appspot.com/22061/diff/1/11
    File Lib/_pyio.py (right):

    http://codereview.appspot.com/22061/diff/1/11#newcode266
    Line 266: fp is closed after the suite of the with statment is complete:
    statment -> statement

    http://codereview.appspot.com/22061/diff/1/11#newcode844
    Line 844: self._reset_read_buf()
    Setting "_read_buf" and "_read_pos" directly on __init__ may help
    introspection tools.

    http://codereview.appspot.com/22061/diff/1/11#newcode963
    Line 963: DEAFULT_BUFFER_SIZE. If max_buffer_size is omitted, it
    defaults to
    DEAFULT_BUFFER_SIZE -> DEFAULT_BUFFER_SIZE

    http://codereview.appspot.com/22061/diff/1/11#newcode1728
    Line 1728: decoder = self._decoder or self._get_decoder()
    'decoder' isn't used in this method, is this here for an useful
    side-effect?

    http://codereview.appspot.com/22061/diff/1/11#newcode1784
    Line 1784: more_line = ''
    This seems unused.

    http://codereview.appspot.com/22061

    @benjaminp
    Copy link
    Contributor

    2009/3/3 Daniel Diniz <report@bugs.python.org>:

    A couple of typos in the Python implementation.

    Thanks for taking a look! Fixed these things in r70135.

    http://codereview.appspot.com/22061/diff/1/11#newcode844
    Line 844: self._reset_read_buf()
    Setting "_read_buf" and "_read_pos" directly on __init__ may help
    introspection tools.

    Perhaps, but I think it duplicates too much of _reset_read_buf(). And
    it wouldn't damage introspection, just static analysis.

    http://codereview.appspot.com/22061/diff/1/11#newcode1728
    Line 1728: decoder = self._decoder or self._get_decoder()
    'decoder' isn't used in this method, is this here for an useful
    side-effect?

    Yes, it's for side affect, but it needn't be in a variable.

    @benjaminp
    Copy link
    Contributor

    And the io-c branch has been merged in r70152.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants