classification
Title: pickle.load expects an object that implements readinto
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.9, Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: pitrou Nosy List: Nathan.Goldbaum, lukasz.langa, miss-islington, pitrou
Priority: normal Keywords: patch

Created on 2020-02-19 01:04 by Nathan.Goldbaum, last changed 2020-02-24 16:19 by Nathan.Goldbaum. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 18592 merged pitrou, 2020-02-21 13:00
PR 18630 merged miss-islington, 2020-02-23 22:34
Messages (9)
msg362242 - (view) Author: Nathan Goldbaum (Nathan.Goldbaum) Date: 2020-02-19 01:04
As of https://github.com/python/cpython/pull/7076, it looks like at least the C implementation of pickle.load expects the file argument to implement readinto:

https://github.com/python/cpython/blob/ffd9753a944916ced659b2c77aebe66a6c9fbab5/Modules/_pickle.c#L1617-L1622

This is a change in behavior relative to previous versions of Python and I don't see it mentioned in PEP 574 or in the pull request so I'm not sure why it was changed.

This change breaks some PyTorch tests (see https://github.com/pytorch/pytorch/issues/32289) and, at least one PyTorch user, although I don't have full details there.

I can try to fix this on the PyTorch side but I first want to check that this was an intentional change on the Python side of things.
msg362259 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2020-02-19 09:06
Right, I think at some point during the PEP 574 implementation review, we decided that the input object had to be a proper file implementation
(for example, deriving from io.BufferedIOBase).

I think it should be easy for PyTorch to fix this internally (by deriving from the proper io base class), but we can also add a fix to pickle.
msg362281 - (view) Author: Nathan Goldbaum (Nathan.Goldbaum) Date: 2020-02-19 15:50
In this case the tests are explicitly testing that a file-like object that does not implement readinto works with torch.load (which is using pickles under the hood). See https://github.com/pytorch/pytorch/blob/master/test/test_serialization.py#L416-L429, in particular the usage of FilelikeMock with has_readinto set to False.

This goes back to https://github.com/pytorch/pytorch/pull/5466, however I'm not sure offhand why it was decided to explicitly test file-like objects that don't implement readinto. I don't know how many consumers there are for this API in pytorch, but there is at least one user who has reported an error traceback caused by this change. I'm hoping to hear more to see whether there are any libraries that depend on pytorch and are implicitly depending on this functionality.

Googling the error message that Python3.8 emits in this situation also leads me to this bug report in a different library: https://github.com/web2py/py4web/issues/77
msg362291 - (view) Author: Nathan Goldbaum (Nathan.Goldbaum) Date: 2020-02-19 21:44
So I *think* I've pieced together what caused the user crash that originated in the flair library. It turns out that pickle.load, via torch.load, is getting passed an mmap.mmap. 

https://github.com/flairNLP/flair/blob/1d44abf35f1122d1e146c9a219b2cb5f98b41b40/flair/file_utils.py#L25-L36

https://github.com/flairNLP/flair/blob/1d44abf35f1122d1e146c9a219b2cb5f98b41b40/flair/nn.py#L85-L86

Since mmap doesn't implement readinto, pickle.load objects as of Python 3.8. This is new behavior in Python3.8, it used to be possible to load a memory-mapped pickle file.

Short repro script:

import pickle
import mmap
data = "some data"

with open('my_data.pkl', 'wb') as f:
    pickle.dump(data, f)

with open("my_data.pkl", "r+b") as f_in:
    mm = mmap.mmap(f_in.fileno(), 0)

print(pickle.load(mm))

On Python3.8, this script prints an error, on Python3.7 it prints "some data".
msg362400 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2020-02-21 10:04
Well, in the mmap case, this is trivially fixed by writing:
```
with open("my_data.pkl", "r+b") as f_in:
    mm = mmap.mmap(f_in.fileno(), 0)

print(pickle.loads(memoryview(mm)))
```

It will also be faster this way, since the pickle will be read directly from memory instead of going through read() calls.

This makes me think that perhaps this issue isn't as bad: it forces people to abandon bad idioms :-)
msg362546 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2020-02-23 22:34
New changeset 9f37872e307734666a7169f7be6e3370d3068282 by Antoine Pitrou in branch 'master':
bpo-39681: Fix C pickle regression with minimal file-like objects (#18592)
https://github.com/python/cpython/commit/9f37872e307734666a7169f7be6e3370d3068282
msg362548 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2020-02-23 22:53
New changeset b19f7ecfa3adc6ba1544225317b9473649815b38 by Miss Islington (bot) in branch '3.8':
bpo-39681: Fix C pickle regression with minimal file-like objects (GH-18592) (#18630)
https://github.com/python/cpython/commit/b19f7ecfa3adc6ba1544225317b9473649815b38
msg362585 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2020-02-24 13:06
The regression is fixed in git master and in the 3.8 branch. However, I would still advice against the idiom of handling a mmap object as a regular file when using pickle.
msg362596 - (view) Author: Nathan Goldbaum (Nathan.Goldbaum) Date: 2020-02-24 16:19
Thank you for the fix! Yes I'm planning to file an issue with flair about this and patch this use case in PyTorch itself.
History
Date User Action Args
2020-02-24 16:19:04Nathan.Goldbaumsetmessages: + msg362596
2020-02-24 13:06:05pitrousetstatus: open -> closed
resolution: fixed
messages: + msg362585

stage: patch review -> resolved
2020-02-23 22:53:27lukasz.langasetmessages: + msg362548
2020-02-23 22:34:06miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request17995
2020-02-23 22:34:00lukasz.langasetnosy: + lukasz.langa
messages: + msg362546
2020-02-21 13:01:01pitrousetassignee: pitrou
type: behavior
2020-02-21 13:00:20pitrousetkeywords: + patch
stage: patch review
pull_requests: + pull_request17960
2020-02-21 10:04:16pitrousetmessages: + msg362400
2020-02-19 21:44:28Nathan.Goldbaumsetmessages: + msg362291
2020-02-19 15:50:39Nathan.Goldbaumsetmessages: + msg362281
2020-02-19 09:06:47pitrousetmessages: + msg362259
versions: + Python 3.9
2020-02-19 01:24:40xtreaksetnosy: + pitrou
2020-02-19 01:04:49Nathan.Goldbaumcreate