Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file written using marshal in 3.2 can be read by 2.7, but not 3.2 or 3.3 #56500

Closed
vsajip opened this issue Jun 9, 2011 · 40 comments
Closed
Labels
release-blocker type-bug An unexpected behavior, bug, or error

Comments

@vsajip
Copy link
Member

vsajip commented Jun 9, 2011

BPO 12291
Nosy @birkenfeld, @vsajip, @amauryfa, @pitrou, @vstinner, @benjaminp, @djc
Files
  • data.bin: Test data - objects written using marshal in Python 3.2
  • tried_renaming.diff
  • ab1c38ffb8d4.diff
  • multiple_dump_load_test.patch: failing test to test_marshal.py
  • multiple_dump_load_read_write_test.patch: multiple dump write, load read
  • 0feab4e7b27f.diff: Merged Engelbert's test patches into fix branch.
  • from_antoine_comments.diff: Updates in response to Antoine's review comments.
  • update_to_test_importlib.diff: Change to test_importlib added
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-07-02.15:43:18.681>
    created_at = <Date 2011-06-09.09:10:09.745>
    labels = ['type-bug', 'release-blocker']
    title = 'file written using marshal in 3.2 can be read by 2.7, but not 3.2 or 3.3'
    updated_at = <Date 2011-07-02.17:52:23.757>
    user = 'https://github.com/vsajip'

    bugs.python.org fields:

    activity = <Date 2011-07-02.17:52:23.757>
    actor = 'georg.brandl'
    assignee = 'none'
    closed = True
    closed_date = <Date 2011-07-02.15:43:18.681>
    closer = 'python-dev'
    components = []
    creation = <Date 2011-06-09.09:10:09.745>
    creator = 'vinay.sajip'
    dependencies = []
    files = ['22289', '22411', '22440', '22472', '22479', '22487', '22528', '22530']
    hgrepos = ['26']
    issue_num = 12291
    keywords = ['patch', 'needs review']
    message_count = 40.0
    messages = ['137943', '137944', '137966', '137969', '137973', '137987', '138012', '138024', '138070', '138160', '138162', '138571', '138585', '138617', '138618', '138623', '138624', '138625', '138631', '138636', '138637', '138661', '138665', '138669', '138672', '139095', '139151', '139204', '139247', '139262', '139508', '139510', '139513', '139537', '139539', '139588', '139628', '139661', '139662', '139663']
    nosy_count = 10.0
    nosy_names = ['georg.brandl', 'vinay.sajip', 'amaury.forgeotdarc', 'grubert', 'pitrou', 'vstinner', 'benjamin.peterson', 'djc', 'Arfrever', 'python-dev']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue12291'
    versions = ['Python 3.2', 'Python 3.3']

    @vsajip
    Copy link
    Member Author

    vsajip commented Jun 9, 2011

    The attached file 'data.bin' was written using Python 3.2. It can be read by Python 2.7, but in 3.2 and 3.3, after the first object is read, the file pointer is positioned at EOF, causing an error on subsequent reads. A simple test script 'marshtest.py' is below:

    import marshal
    import sys
    
    f = open('data.bin', 'rb')
    t = marshal.load(f)
    print('Python version:', sys.version)
    print('Object just loaded was:\n%r' % (t,))
    print('File position is now at %d' % f.tell())
    t = marshal.load(f)
    print('Object just loaded was:\n%r' % (t,))
    print('File position is now at %d' % f.tell())

    Results of running it under various Python versions:

    vinay@eta-natty:~/projects/scratch$ python marshtest.py
    ('Python version:', '2.7.1+ (r271:86832, Apr 11 2011, 18:05:24) \n[GCC 4.5.2]')
    Object just loaded was:
    (u'text', u'alfa', 202, 1.0, '\x00\x00\x00\x01]q\x00K\x03a')
    File position is now at 52
    Object just loaded was:
    (u'text', u'alfa', 212, 1.0, '\x00\x00\x00\x01]q\x00K\x03a')
    File position is now at 104

    vinay@eta-natty:~/projects/scratch$ python3.2 marshtest.py 
    Python version: 3.2 (r32:88445, Mar 25 2011, 19:28:28) 
    [GCC 4.5.2]
    Object just loaded was:
    ('text', 'alfa', 202, 1.0, b'\x00\x00\x00\x01]q\x00K\x03a')
    File position is now at 53617
    Traceback (most recent call last):
      File "marshtest.py", line 9, in <module>
        t = marshal.load(f)
    EOFError: EOF read where object expected
    
    vinay@eta-natty:~/projects/scratch$ python3.3 marshtest.py 
    Python version: 3.3a0 (default:8d4d87dd73ae, Apr  2 2011, 14:25:31) 
    [GCC 4.5.2]
    Object just loaded was:
    ('text', 'alfa', 202, 1.0, b'\x00\x00\x00\x01]q\x00K\x03a')
    File position is now at 53617
    Traceback (most recent call last):
      File "marshtest.py", line 9, in <module>
        t = marshal.load(f)
    EOFError: EOF read where object expected

    Note the size of the file is 53617 bytes.

    vinay@eta-natty:~/projects/scratch$ ls -l data.bin
    -rw------- 1 vinay vinay 53617 2011-06-09 09:33 data.bin

    @amauryfa
    Copy link
    Member

    amauryfa commented Jun 9, 2011

    Sadly, marshal.load() looks broken:

    • The function starts with the comment
      /* XXX Quick hack -- need to do this differently */

    • It starts by calling f.read() which consumes the whole file (and explains the issue reported here)

    - The code was probably converted too quickly:
        if (PyBytes_Check(data)) {
              ...
        else if (PyBytes_Check(data)) {

    @vsajip
    Copy link
    Member Author

    vsajip commented Jun 9, 2011

    I've added a patch which I think fixes marshal.load(), and would be grateful for a review - it's my first C patch for Python. The test_marshal test passes, though I haven't yet added a test for the failing scenario.

    @amauryfa
    Copy link
    Member

    amauryfa commented Jun 9, 2011

    • Please replace tabs characters by space
    • "//" comments are not accepted by some picky C89 compilers

    Also, calling f.read(1) for each character may have a large performance impact. I don't know how this can be solved though.

    @benjaminp
    Copy link
    Contributor

    It should probably be buffered at 512 bytes or some other reasonable number.

    @birkenfeld
    Copy link
    Member

    Sounds blocker-ish enough to me.

    @vsajip
    Copy link
    Member Author

    vsajip commented Jun 9, 2011

    @Amaury: Re formatting and comment style - I will address this. The diff I posted is to the default branch rather than 3.2, though it won't change the main substance of the patch.

    Re. performance and buffering: I agree this could have an impact (although strings are not actually read a byte at a time), but it's not easy to address because even if you e.g. wrapped the incoming stream with a BufferedReader, you have nowhere to keep the resulting instance in between multiple calls to marshal.load() from external code.

    For best performance ISTM you'd have to implement your own buffering - but surely this is to be avoided? It can't be the only place where you need good I/O performance from file-like objects. A quick scan of the pickle module didn't show me anything I could easily use, and a key difference from pickle is that you don't have a corresponding "Unmarshaller" object which could hold the state that an Unpickler does, and which would be needed to hold the buffering state.

    Of course one could implement limited buffering, i.e. for just the duration of a single marshal.load() call, but perhaps this could be done as a later step after evaluating the actual performance of this implementation. Having something that works correctly, albeit slowly, is better than having the current situation ...

    @vsajip
    Copy link
    Member Author

    vsajip commented Jun 9, 2011

    Another thought about buffering, which might be a bit of a show-stopper: if in a particular load() call we read ahead for buffering purposes, we've altered the underlying stream (which might not be seekable to allow position restoring). The next call to load() will be likely to fail, since the underlying stream has been positioned further than expected by the earlier load() call, and the buffer state has been lost. Not sure how easy it is to get around this ...

    @vsajip
    Copy link
    Member Author

    vsajip commented Jun 10, 2011

    Attached is an improved patch:

    1. It's against the 3.2 branch, rather than default.
    2. It has an added multiple load test in test_marshal.py.
    3. There's more error checking for an EOF condition.
    4. I've removed tab chars and used /* C89 comments */.

    Of course it's not as fast as the 2.x version, since that had a limitation on only being able to load from file objects, and a fast path for PyFile using stdio calls directly. Hard to compete with that!

    Suggestions for improvements gratefully received :-)

    @vsajip
    Copy link
    Member Author

    vsajip commented Jun 11, 2011

    Patch is now in my public sandbox on hg.python.org.

    @vsajip
    Copy link
    Member Author

    vsajip commented Jun 11, 2011

    For some reason, "Create Patch" is failing with a

    [Errno 2] No such file or directory: '/home/roundup/trackers/tracker/cpython/Doc/Makefile'

    I've logged an issue on the meta tracker:

    http://psf.upfronthosting.co.za/roundup/meta/issue405

    @birkenfeld
    Copy link
    Member

    Any reviewers?

    @benjaminp
    Copy link
    Contributor

    Why can't you just call fileno() on the file object?

    @vsajip
    Copy link
    Member Author

    vsajip commented Jun 18, 2011

    Sorry I'm being dense, but which file object do you mean?

    @benjaminp
    Copy link
    Contributor

    2011/6/18 Vinay Sajip <report@bugs.python.org>:

    Vinay Sajip <vinay_sajip@yahoo.co.uk> added the comment:

    Sorry I'm being dense, but which file object do you mean?

    The python file object.

    @vsajip
    Copy link
    Member Author

    vsajip commented Jun 19, 2011

    Benjamin Peterson <benjamin@python.org> added the comment:
    > Vinay Sajip <vinay_sajip@yahoo.co.uk> added the comment:
    >
    > Sorry I'm being dense, but which file object do you mean?

    The python file object.

    Do you mean special-case handling of the circumstance when the file-like object
    being marshalled from is actually a file? The existing code paths (from when
    marshal.load expected only to work with Python file objects) use FILE *, and
    these are used by other code to read magic numbers etc. I believe. Or perhaps
    I'm still misunderstanding what you're getting at.

    @benjaminp
    Copy link
    Contributor

    2011/6/18 Vinay Sajip <report@bugs.python.org>:

    Vinay Sajip <vinay_sajip@yahoo.co.uk> added the comment:

    > Benjamin Peterson <benjamin@python.org> added the  comment:
    > >  Vinay Sajip <vinay_sajip@yahoo.co.uk> added the  comment:
    > >
    > > Sorry I'm being dense, but which file object do you  mean?
    >
    > The python file  object.

    Do you mean special-case handling of the circumstance when the file-like object
    being marshalled from is actually a file? The existing code paths (from when
    marshal.load expected only to work with Python file objects) use FILE *, and
    these are used by other code to read magic numbers etc. I believe. Or perhaps
    I'm still misunderstanding what you're getting at.

    When python uses dump() or load() with a file object, you can call
    fileno() (and then fopen?) to use it like a C-file object.

    @vsajip
    Copy link
    Member Author

    vsajip commented Jun 19, 2011

    The problem with calling fileno() and fdopen() is that you bypass the buffering information held in BufferedIOReader. The first call works, but the FILE * pointer is now positioned at 4K, rather than just past the end of the object just read. The next call fails.

    I verified that calling f.tell() after marshal.load(f) returns 4096, rather than just the size of the object read by the load().

    Just to be clear, here's what I did in marshal_load:

        int is_file = 0;
        int fd;
    
        data = PyObject_CallMethod(f, "fileno", "");
        if (data == NULL)
            PyErr_Clear();
        else {
            fd = PyLong_AsLong(data);
            Py_DECREF(data);
            is_file = 1;
        }
        if (is_file) {
            rf.readable = NULL;
            rf.fp = fdopen(fd, "rb");
        }
        else {
            /* what I was doing before to set up rf */
        }
        /* and on to the read_object call */

    @benjaminp
    Copy link
    Contributor

    2011/6/19 Vinay Sajip <report@bugs.python.org>:

    Vinay Sajip <vinay_sajip@yahoo.co.uk> added the comment:

    The problem with calling fileno() and fdopen() is that you bypass the buffering information held in BufferedIOReader. The first call works, but the FILE * pointer is now positioned at 4K, rather than just past the end of the object just read. The next call fails.

    But presumably once you have the fd, you can llseek().

    @vsajip
    Copy link
    Member Author

    vsajip commented Jun 19, 2011

    This seems a bit hacky, and I'm not sure how reliable it is. I added this after the read_object call:

        if (is_file) {
            PyObject * newpos;
            int cp, np;
    
            cp = ftell(rf.fp);
            newpos = PyObject_CallMethod(f, "seek", "ii", cp, SEEK_SET);
            assert(newpos != NULL);
            np = PyLong_AsLong(newpos);
            Py_DECREF(newpos);
            assert(cp == np);
        }

    When I run the code lots of times, I sometimes get assertion failures at the

    assert(newpos != NULL)

    line. It's not an EOF condition thing, necessarily: I do get expected behaviour at least sometimes when seeking to the end of file.

    @benjaminp
    Copy link
    Contributor

    2011/6/19 Vinay Sajip <report@bugs.python.org>:

    Vinay Sajip <vinay_sajip@yahoo.co.uk> added the comment:

    This seems a bit hacky, and I'm not sure how reliable it is. I added this after the read_object call:

       if (is_file) {
           PyObject * newpos;
           int cp, np;

           cp = ftell(rf.fp);
           newpos = PyObject_CallMethod(f, "seek", "ii", cp, SEEK_SET);

    Why not 0?

           assert(newpos != NULL);
           np = PyLong_AsLong(newpos);
           Py_DECREF(newpos);
           assert(cp == np);
       }

    When I run the code lots of times, I sometimes get assertion failures at the

    assert(newpos != NULL)

    That's because the call is failing. Why?

    line. It's not an EOF condition thing, necessarily: I do get expected behaviour at least sometimes when seeking to the end of file.

    @vsajip
    Copy link
    Member Author

    vsajip commented Jun 19, 2011

    Benjamin Peterson <benjamin@python.org> added the comment:

    > assert(newpos != NULL)

    That's because the call is failing. Why?

    It's seemingly because the Python code did a seek (in Python) which was not
    communicated to the FILE object; after reading all objects from the file, a seek
    was done to the beginning to start reading again. On the previous call, the FILE
    would have been at EOF (at a C level). So it seems as if we have to call ftell
    on the Python object at the beginning of the read, and do an fseek to sync the
    positions. It's doable, of course, so it's the next thing I'll try, but what I'm
    worried about is the thing turning into a bit of a rabbit-hole.

    I haven't looked at the io implementation, but for this sort of synchronisation,
    it would be better if when based on a FILE object, the io implementation
    delegated all its operations to the FILE object - but it doesn't look like
    that's the case, and I'm uncomfortable that there may be some undesirable
    consequences of that.

    @vsajip
    Copy link
    Member Author

    vsajip commented Jun 19, 2011

    @benjamin: I missed commenting on your "Why not 0?", but here's the reasoning: one can't assume that the file only contains one object to be read, at the beginning of the file. It may be that some data is being written to file using marshal.dump, interspersed with other data, and then the stream is being read in at a later time, with marshal.load called to load a previously saved object. In that scenario, why would 0 be always the correct offset value to pass to fseek?

    I am synchronising the Python object with the FILE *, but not the other way around - in the failing case I mentioned, external Python code has positioned the io object to zero, but that of course will be overwritten to point the io object back to where the FILE object is positioned. Perhaps I just need to go the other way ...

    I'm not sure how reliable a solution will be which tries to work around there apparently being two buffering implementations - at the io level and at the FILE level, and trying to keep them synced in a seemingly ad hoc fashion.

    It's perfectly possible that I don't know what I'm talking about, so I'd welcome an improved patch using the ideas you've suggested. It's been a very long time since I wrote C stdio code regularly ;-)

    @benjaminp
    Copy link
    Contributor

    I think you're right about playing with the bare fd being too fragile. I think a simpler solution is to read say 1024 bytes at a time and buffer it internally.

    @vsajip
    Copy link
    Member Author

    vsajip commented Jun 19, 2011

    Benjamin Peterson <benjamin@python.org> added the comment:

    I think you're right about playing with the bare fd being too fragile. I think
    a simpler solution is to read say 1024 bytes at a time and buffer it
    internally.

    Doesn't this suffer from a similar problem? Namely, external Python code
    expecting the stream pointer to always be just past the object just read. See my
    earlier comments pointing out that there's nowhere to store the buffer state
    between successive calls to marshal.load. And the synchronising can be a problem
    to achieve with non-seekable streams (including, but not limited to, sockets).

    I'm not trying to be difficult. No, really! :-)

    @grubert
    Copy link
    Mannequin

    grubert mannequin commented Jun 25, 2011

    patch to test_marshal.py that obviously fails in current implementation.

    IMHO is file might not be seekable one can not cache so maybe do not do it.

    @grubert
    Copy link
    Mannequin

    grubert mannequin commented Jun 26, 2011

    add interleaved writing to the same file (which might happen or not?)

    the tests pass when vinay's marshal.c is applied.

    sorry for another patch

    @vsajip
    Copy link
    Member Author

    vsajip commented Jun 26, 2011

    I've incorporated Engelbert's two patches into test_marshal.py in my sandbox fix branch. I coalesced them into a single additional test in the existing BugsTestCase.

    @vsajip vsajip added the type-bug An unexpected behavior, bug, or error label Jun 26, 2011
    @pitrou
    Copy link
    Member

    pitrou commented Jun 27, 2011

    Le Sun, 26 Jun 2011 19:49:03 +0000,
    Vinay Sajip <report@bugs.python.org> a écrit :

    Added file: http://bugs.python.org/file22487/0feab4e7b27f.diff

    Just a nit, could you give descriptive file names to your patches?
    Hex numbers quickly get confusing.

    @vsajip
    Copy link
    Member Author

    vsajip commented Jun 27, 2011

    Just a nit, could you give descriptive file names to your patches?
    Hex numbers quickly get confusing.

    Ok - I was under the impression that those names were generated automatically from the changeset hash, and that changing the name arbitrarily would break something. Is it not better/sufficient if I just update the description?

    @pitrou
    Copy link
    Member

    pitrou commented Jun 30, 2011

    I think the question is: will the slowdown apply to module import, or only to explicit uses of the marshal module? If the latter, then I think we can live with it - we discourage using marshal as a general-purpose serialization scheme anyway.

    As for the patch:

    • why do the tests have to carry a large chunk of binary data? if some data is needed, at least it would be nice if it were kept small
    • not sure either why it needs zlib and base64; just use the repr() of the bytestring
    • "assertEqual = self.assertEqual" is more of a nuisance than anything else; you're making the code more complicated to read just to save a few keystrokes; same for "assertIsInstance = self.assertIsInstance"
    • can you add a comment next to the fields you're adding to the marshal C struct?
    • if the "r_byte" macro isn't used anymore, it should be removed, not commented out
    • in r_string(), what happens if PyBytes_Check(data) is false? there should probably be a TypeError of some sort
    • in r_string() again, if data is NULL after the read() call, the exception shouldn't be shadowed by an EOFError
    • why do you call read(1) at the beginning? it seems to make the code more complicated for no useful purpose

    @vsajip
    Copy link
    Member Author

    vsajip commented Jun 30, 2011

    Antoine Pitrou <pitrou@free.fr> added the comment:

    I think the question is: will the slowdown apply to module import, or only to
    explicit uses of the marshal module? If the latter, then I think we can live
    with it - we discourage using marshal as a general-purpose serialization scheme
    anyway.

    Thanks for reviewing the patch.

    As to your point - agreed, and as the marshal code is completely broken now,
    something that works is an improvement, even if it's slower than optimal. If
    that turns out to be a problem in practice, it can be fixed.

    As for the patch:

    • why do the tests have to carry a large chunk of binary data? if some data is
      needed, at least it would be nice if it were kept small

    Agreed, I just used the initial file where I had the problem - didn't know where
    the problem was, initially. Will look at reducing this.

    • not sure either why it needs zlib and base64; just use the repr() of the
      bytestring

    The original data was 53K, so just the repr of the bytestring of the raw data is
    around 150K. If I zip the data, it's about 4K, but the repr of that data is
    around 12K. The base64/zlibbed version is 5K.

    This will be less of an issue when the test data size is reduced.

    • "assertEqual = self.assertEqual" is more of a nuisance than anything else;
      you're making the code more complicated to read just to save a few keystrokes;
      same for "assertIsInstance = self.assertIsInstance"

    I'm not sure how it's more complicated or harder to understand. I didn't do it
    just to save keystrokes when writing, it's also to avoid noise when reading. IMO
    this is a question of personal taste, or is this approach proscribed somewhere?
    I've certainly seen it in other Python code.

    • can you add a comment next to the fields you're adding to the marshal C
      struct?

    Yes, will do.

    • if the "r_byte" macro isn't used anymore, it should be removed, not
      commented out

    The commenting out is a temporary measure, the comment will be removed before
    commit.

    • in r_string(), what happens if PyBytes_Check(data) is false? there should
      probably be a TypeError of some sort

    There is a PyBytes_Check in marshal_load which raises a TypeError if it fails.
    The PyBytes_Check in r_string is only called in the path from marshal_load
    (because p->readable is only non-NULL in that path) - so for the failure to
    occur one would need an initial read to deliver bytes, and a subsequent read to
    deliver non-bytes. Still, I have no problem adding a TypeError raising in
    r_string if PyBytes_Check fails.

    • in r_string() again, if data is NULL after the read() call, the exception
      shouldn't be shadowed by an EOFError

    Good point, I aim to fix this by changing the lower condition to

    if (!PyErr_Occurred() && (read < n)) { ... }

    which should be sufficient - do you agree?

    • why do you call read(1) at the beginning? it seems to make the code more
      complicated for no useful purpose

    I gave the reasoning in the comment just above the read(1). I agree it makes it
    a little more complicated, but not onerously so - and the approach fails fast as
    well as allowing operation with any stream, not just random-seekable ones.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 30, 2011

    > - "assertEqual = self.assertEqual" is more of a nuisance than anything else;
    >you're making the code more complicated to read just to save a few keystrokes;
    >same for "assertIsInstance = self.assertIsInstance"

    I'm not sure how it's more complicated or harder to understand. I didn't do it
    just to save keystrokes when writing, it's also to avoid noise when reading. IMO
    this is a question of personal taste, or is this approach proscribed somewhere?

    It's not proscribed, but trying to remove the "self." because it's
    supposed to be more readable is a bit of a strange thing to do.
    Also, people reading the test suite should be accustomed to
    "self.assertEqual" anyway, so there's no point trying to hide it.

    > - in r_string(), what happens if PyBytes_Check(data) is false? there should
    >probably be a TypeError of some sort

    There is a PyBytes_Check in marshal_load which raises a TypeError if it fails.
    The PyBytes_Check in r_string is only called in the path from marshal_load
    (because p->readable is only non-NULL in that path) - so for the failure to
    occur one would need an initial read to deliver bytes, and a subsequent read to
    deliver non-bytes. Still, I have no problem adding a TypeError raising in
    r_string if PyBytes_Check fails.

    Error checking can't just be probabilistic. Perhaps there's a bug in the
    file-like object; or perhaps it is a non-blocking IO object and read()
    will return None at times.

    > - why do you call read(1) at the beginning? it seems to make the code more
    >complicated for no useful purpose

    I gave the reasoning in the comment just above the read(1). I agree it makes it
    a little more complicated, but not onerously so - and the approach fails fast as
    well as allowing operation with any stream, not just random-seekable ones.

    Well, it wouldn't fail any slower if you didn't do it, since you need to
    call read() very soon anyway (presumably as part of the same call to
    marshal.load()). Failing "fast" doesn't seem to bring anything here. My
    vote is for removing the complication.

    @vsajip
    Copy link
    Member Author

    vsajip commented Jul 1, 2011

    Antoine Pitrou <pitrou@free.fr> added the comment:

    It's not proscribed, but trying to remove the "self." because it's
    supposed to be more readable is a bit of a strange thing to do.
    Also, people reading the test suite should be accustomed to
    "self.assertEqual" anyway, so there's no point trying to hide it.

    It wasn't particularly about self - I'm not against it. Anyway, it's not a big
    deal for me, so I've added the selves back :-)

    Error checking can't just be probabilistic. Perhaps there's a bug in the
    file-like object; or perhaps it is a non-blocking IO object and read()
    will return None at times.

    You're right, so I've raised a TypeError if PyBytes_Check fails in r_string.

    Well, it wouldn't fail any slower if you didn't do it, since you need to
    call read() very soon anyway (presumably as part of the same call to
    marshal.load()). Failing "fast" doesn't seem to bring anything here. My
    vote is for removing the complication.

    Actually I misremembered the complete reason for the call - it was there to
    additionally check that the passed object has a read method. I also realised -
    duh - that I can read zero bytes and still get an empty bytes object back, so
    I've done that, and it does look cleaner. I've also reorganised the marshal_load
    function a little so it flows better.

    Your other points have also all been addressed:

    I've seen that the original test with the long binary data was exercising the
    same functionality as Engelbert Gruber's later patch was testing. So I've
    removed my earlier test entirely and added tuples and lists into the data being
    marshalled in Engelbert's version of the test. As a result of these changes, the
    zlib/base64 dependency goes away.

    The new "readable" field of the C struct has a comment saying what it is.

    The commented out r_byte macro has been removed.

    Thanks, again, for the review!

    @vsajip
    Copy link
    Member Author

    vsajip commented Jul 1, 2011

    As a result of the changes to marshal.c, test_importlib needs a small change: the test_bad_marshal raises an EOFError now, whereas it raised ValueError before.

    I think it's because the earlier code in marshal didn't properly check for EOF conditions in some places.

    So I've changes the assertRaises(ValueError) to assertRaises(EOFError).

    All tests now pass, other than test_ftplib (not related).

    @pitrou
    Copy link
    Member

    pitrou commented Jul 1, 2011

    Latest patch looks ok to me.

    @birkenfeld
    Copy link
    Member

    Can we get this committed for 3.2.1 then?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jul 2, 2011

    New changeset edba722f3b02 by Vinay Sajip in branch '3.2':
    Closes bpo-12291: Fixed bug which was found when doing multiple loads from one stream.
    http://hg.python.org/cpython/rev/edba722f3b02

    @python-dev python-dev mannequin closed this as completed Jul 2, 2011
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jul 2, 2011

    New changeset 42dd11028e94 by Vinay Sajip in branch 'default':
    Closes bpo-12291 for 3.3 - merged fix from 3.2.
    http://hg.python.org/cpython/rev/42dd11028e94

    @birkenfeld
    Copy link
    Member

    Thanks!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    release-blocker type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants