New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
file written using marshal in 3.2 can be read by 2.7, but not 3.2 or 3.3 #56500
Comments
The attached file 'data.bin' was written using Python 3.2. It can be read by Python 2.7, but in 3.2 and 3.3, after the first object is read, the file pointer is positioned at EOF, causing an error on subsequent reads. A simple test script 'marshtest.py' is below: import marshal
import sys
f = open('data.bin', 'rb')
t = marshal.load(f)
print('Python version:', sys.version)
print('Object just loaded was:\n%r' % (t,))
print('File position is now at %d' % f.tell())
t = marshal.load(f)
print('Object just loaded was:\n%r' % (t,))
print('File position is now at %d' % f.tell()) Results of running it under various Python versions: vinay@eta-natty:~/projects/scratch$ python marshtest.py vinay@eta-natty:~/projects/scratch$ python3.2 marshtest.py
Python version: 3.2 (r32:88445, Mar 25 2011, 19:28:28)
[GCC 4.5.2]
Object just loaded was:
('text', 'alfa', 202, 1.0, b'\x00\x00\x00\x01]q\x00K\x03a')
File position is now at 53617
Traceback (most recent call last):
File "marshtest.py", line 9, in <module>
t = marshal.load(f)
EOFError: EOF read where object expected
vinay@eta-natty:~/projects/scratch$ python3.3 marshtest.py
Python version: 3.3a0 (default:8d4d87dd73ae, Apr 2 2011, 14:25:31)
[GCC 4.5.2]
Object just loaded was:
('text', 'alfa', 202, 1.0, b'\x00\x00\x00\x01]q\x00K\x03a')
File position is now at 53617
Traceback (most recent call last):
File "marshtest.py", line 9, in <module>
t = marshal.load(f)
EOFError: EOF read where object expected Note the size of the file is 53617 bytes. vinay@eta-natty:~/projects/scratch$ ls -l data.bin |
Sadly, marshal.load() looks broken:
- The code was probably converted too quickly:
if (PyBytes_Check(data)) {
...
else if (PyBytes_Check(data)) { |
I've added a patch which I think fixes marshal.load(), and would be grateful for a review - it's my first C patch for Python. The test_marshal test passes, though I haven't yet added a test for the failing scenario. |
Also, calling f.read(1) for each character may have a large performance impact. I don't know how this can be solved though. |
It should probably be buffered at 512 bytes or some other reasonable number. |
Sounds blocker-ish enough to me. |
@Amaury: Re formatting and comment style - I will address this. The diff I posted is to the default branch rather than 3.2, though it won't change the main substance of the patch. Re. performance and buffering: I agree this could have an impact (although strings are not actually read a byte at a time), but it's not easy to address because even if you e.g. wrapped the incoming stream with a BufferedReader, you have nowhere to keep the resulting instance in between multiple calls to marshal.load() from external code. For best performance ISTM you'd have to implement your own buffering - but surely this is to be avoided? It can't be the only place where you need good I/O performance from file-like objects. A quick scan of the pickle module didn't show me anything I could easily use, and a key difference from pickle is that you don't have a corresponding "Unmarshaller" object which could hold the state that an Unpickler does, and which would be needed to hold the buffering state. Of course one could implement limited buffering, i.e. for just the duration of a single marshal.load() call, but perhaps this could be done as a later step after evaluating the actual performance of this implementation. Having something that works correctly, albeit slowly, is better than having the current situation ... |
Another thought about buffering, which might be a bit of a show-stopper: if in a particular load() call we read ahead for buffering purposes, we've altered the underlying stream (which might not be seekable to allow position restoring). The next call to load() will be likely to fail, since the underlying stream has been positioned further than expected by the earlier load() call, and the buffer state has been lost. Not sure how easy it is to get around this ... |
Attached is an improved patch:
Of course it's not as fast as the 2.x version, since that had a limitation on only being able to load from file objects, and a fast path for PyFile using stdio calls directly. Hard to compete with that! Suggestions for improvements gratefully received :-) |
Patch is now in my public sandbox on hg.python.org. |
For some reason, "Create Patch" is failing with a [Errno 2] No such file or directory: '/home/roundup/trackers/tracker/cpython/Doc/Makefile' I've logged an issue on the meta tracker: |
Any reviewers? |
Why can't you just call fileno() on the file object? |
Sorry I'm being dense, but which file object do you mean? |
2011/6/18 Vinay Sajip <report@bugs.python.org>:
The python file object. |
Do you mean special-case handling of the circumstance when the file-like object |
2011/6/18 Vinay Sajip <report@bugs.python.org>:
When python uses dump() or load() with a file object, you can call |
The problem with calling fileno() and fdopen() is that you bypass the buffering information held in BufferedIOReader. The first call works, but the FILE * pointer is now positioned at 4K, rather than just past the end of the object just read. The next call fails. I verified that calling f.tell() after marshal.load(f) returns 4096, rather than just the size of the object read by the load(). Just to be clear, here's what I did in marshal_load: int is_file = 0;
int fd;
data = PyObject_CallMethod(f, "fileno", "");
if (data == NULL)
PyErr_Clear();
else {
fd = PyLong_AsLong(data);
Py_DECREF(data);
is_file = 1;
}
if (is_file) {
rf.readable = NULL;
rf.fp = fdopen(fd, "rb");
}
else {
/* what I was doing before to set up rf */
}
/* and on to the read_object call */ |
2011/6/19 Vinay Sajip <report@bugs.python.org>:
But presumably once you have the fd, you can llseek(). |
This seems a bit hacky, and I'm not sure how reliable it is. I added this after the read_object call: if (is_file) {
PyObject * newpos;
int cp, np;
cp = ftell(rf.fp);
newpos = PyObject_CallMethod(f, "seek", "ii", cp, SEEK_SET);
assert(newpos != NULL);
np = PyLong_AsLong(newpos);
Py_DECREF(newpos);
assert(cp == np);
} When I run the code lots of times, I sometimes get assertion failures at the assert(newpos != NULL) line. It's not an EOF condition thing, necessarily: I do get expected behaviour at least sometimes when seeking to the end of file. |
2011/6/19 Vinay Sajip <report@bugs.python.org>:
Why not 0?
That's because the call is failing. Why?
|
It's seemingly because the Python code did a seek (in Python) which was not I haven't looked at the io implementation, but for this sort of synchronisation, |
@benjamin: I missed commenting on your "Why not 0?", but here's the reasoning: one can't assume that the file only contains one object to be read, at the beginning of the file. It may be that some data is being written to file using marshal.dump, interspersed with other data, and then the stream is being read in at a later time, with marshal.load called to load a previously saved object. In that scenario, why would 0 be always the correct offset value to pass to fseek? I am synchronising the Python object with the FILE *, but not the other way around - in the failing case I mentioned, external Python code has positioned the io object to zero, but that of course will be overwritten to point the io object back to where the FILE object is positioned. Perhaps I just need to go the other way ... I'm not sure how reliable a solution will be which tries to work around there apparently being two buffering implementations - at the io level and at the FILE level, and trying to keep them synced in a seemingly ad hoc fashion. It's perfectly possible that I don't know what I'm talking about, so I'd welcome an improved patch using the ideas you've suggested. It's been a very long time since I wrote C stdio code regularly ;-) |
I think you're right about playing with the bare fd being too fragile. I think a simpler solution is to read say 1024 bytes at a time and buffer it internally. |
Doesn't this suffer from a similar problem? Namely, external Python code I'm not trying to be difficult. No, really! :-) |
patch to test_marshal.py that obviously fails in current implementation. IMHO is file might not be seekable one can not cache so maybe do not do it. |
add interleaved writing to the same file (which might happen or not?) the tests pass when vinay's marshal.c is applied. sorry for another patch |
I've incorporated Engelbert's two patches into test_marshal.py in my sandbox fix branch. I coalesced them into a single additional test in the existing BugsTestCase. |
Le Sun, 26 Jun 2011 19:49:03 +0000,
Just a nit, could you give descriptive file names to your patches? |
Ok - I was under the impression that those names were generated automatically from the changeset hash, and that changing the name arbitrarily would break something. Is it not better/sufficient if I just update the description? |
I think the question is: will the slowdown apply to module import, or only to explicit uses of the marshal module? If the latter, then I think we can live with it - we discourage using marshal as a general-purpose serialization scheme anyway. As for the patch:
|
Thanks for reviewing the patch. As to your point - agreed, and as the marshal code is completely broken now,
Agreed, I just used the initial file where I had the problem - didn't know where
The original data was 53K, so just the repr of the bytestring of the raw data is This will be less of an issue when the test data size is reduced.
I'm not sure how it's more complicated or harder to understand. I didn't do it
Yes, will do.
The commenting out is a temporary measure, the comment will be removed before
There is a PyBytes_Check in marshal_load which raises a TypeError if it fails.
Good point, I aim to fix this by changing the lower condition to if (!PyErr_Occurred() && (read < n)) { ... } which should be sufficient - do you agree?
I gave the reasoning in the comment just above the read(1). I agree it makes it |
It's not proscribed, but trying to remove the "self." because it's
Error checking can't just be probabilistic. Perhaps there's a bug in the
Well, it wouldn't fail any slower if you didn't do it, since you need to |
It wasn't particularly about self - I'm not against it. Anyway, it's not a big
You're right, so I've raised a TypeError if PyBytes_Check fails in r_string.
Actually I misremembered the complete reason for the call - it was there to Your other points have also all been addressed: I've seen that the original test with the long binary data was exercising the The new "readable" field of the C struct has a comment saying what it is. The commented out r_byte macro has been removed. Thanks, again, for the review! |
As a result of the changes to marshal.c, test_importlib needs a small change: the test_bad_marshal raises an EOFError now, whereas it raised ValueError before. I think it's because the earlier code in marshal didn't properly check for EOF conditions in some places. So I've changes the assertRaises(ValueError) to assertRaises(EOFError). All tests now pass, other than test_ftplib (not related). |
Latest patch looks ok to me. |
Can we get this committed for 3.2.1 then? |
New changeset edba722f3b02 by Vinay Sajip in branch '3.2': |
New changeset 42dd11028e94 by Vinay Sajip in branch 'default': |
Thanks! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: