Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datetime.__setstate__ fails decoding python2 pickle #66204

Closed
eddygeek mannequin opened this issue Jul 18, 2014 · 33 comments
Closed

datetime.__setstate__ fails decoding python2 pickle #66204

eddygeek mannequin opened this issue Jul 18, 2014 · 33 comments
Assignees
Labels
3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@eddygeek
Copy link
Mannequin

eddygeek mannequin commented Jul 18, 2014

BPO 22005
Nosy @tim-one, @gpshead, @abalkin, @pitrou, @avassalotti, @serhiy-storchaka, @miss-islington
PRs
  • bpo-22005: Fixed unpickling instances of datetime classes pickled by Python 2. #794
  • bpo-22005: Fixed unpickling instances of datetime classes pickled by Python 2. #11017
  • [3.7] bpo-22005: Fixed unpickling instances of datetime classes pickled by Python 2. (GH-11017) #11022
  • [3.6] [3.7] bpo-22005: Fixed unpickling instances of datetime classes pickled by Python 2. (GH-11017) (GH-11022) #11024
  • bpo-22005: Fix condition for unpickling a date object. #11025
  • bpo-22005: Document the reality of pickle compatibility. #11054
  • [3.7] bpo-22005: Document the reality of pickle compatibility. (GH-11054) #11055
  • Files
  • unpickle_datetime_surrogateescape.patch
  • datetime_portable_pickle-2.7.patch: Makes pickling in 2.7 portable
  • datetime_pickle_bytes-2.7.patch: Makes pickling in 2.7 produce the same data as in 3.x
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2018-12-09.20:12:39.299>
    created_at = <Date 2014-07-18.12:49:52.735>
    labels = ['3.7', '3.8', 'type-bug', 'library']
    title = 'datetime.__setstate__ fails decoding python2 pickle'
    updated_at = <Date 2018-12-09.20:12:39.298>
    user = 'https://bugs.python.org/eddygeek'

    bugs.python.org fields:

    activity = <Date 2018-12-09.20:12:39.298>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2018-12-09.20:12:39.299>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2014-07-18.12:49:52.735>
    creator = 'eddygeek'
    dependencies = []
    files = ['40758', '40759', '40760']
    hgrepos = []
    issue_num = 22005
    keywords = ['patch']
    message_count = 33.0
    messages = ['223408', '223426', '223579', '223587', '252857', '252877', '252880', '252882', '252883', '252884', '252886', '252887', '253042', '253058', '253071', '263380', '288821', '331241', '331298', '331307', '331313', '331319', '331326', '331332', '331333', '331334', '331340', '331421', '331446', '331447', '331449', '331450', '331451']
    nosy_count = 10.0
    nosy_names = ['tim.peters', 'gregory.p.smith', 'belopolsky', 'pitrou', 'bronger', 'alexandre.vassalotti', 'serhiy.storchaka', 'eddygeek', 'tanzer@swing.co.at', 'miss-islington']
    pr_nums = ['794', '11017', '11022', '11024', '11025', '11054', '11055']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue22005'
    versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

    @eddygeek
    Copy link
    Mannequin Author

    eddygeek mannequin commented Jul 18, 2014

    pickle.loads raises a TypeError when calling the datetime constructor, (then a UnicodeEncodeError in the load_reduce function).

    A short test program & the log, including dis output of both PY2 and PY3 pickles, are available in this gist; and extract on stackoverflow:
    https://gist.github.com/eddy-geek/191f15871c1b9f801b76
    http://stackoverflow.com/questions/24805105/

    I am using pickle.dumps(reply, protocol=2) in PY2
    then pickle._loads(pickled, fix_imports=True, encoding='latin1') in PY3
    (tried None and utf-8 without success)

    Native cPickle loads decoding fails too, I am only using pure python's _loads for debugging.

    Sorry if this is misguided (first time here)
    Regards,
    Edward

    @eddygeek eddygeek mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jul 18, 2014
    @tim-one
    Copy link
    Member

    tim-one commented Jul 18, 2014

    I have no idea what was done to pickle for Python3, but this line works for me to unpickle a Python2 protocol 2 datetime pickle under Python3, where P2 is the Python2 pickle string:

    pickle.loads(bytes(P2, encoding='latin1'), encoding='bytes')
    

    For example,

    >>> P2
    '\x80\x02cdatetime\ndatetime\nq\x00U\n\x07Þ\x07\x12\r%%\x06á¸q\x01\x85q\x02Rq\x03.'
    >>> pickle.loads(bytes(P2, encoding='latin1'), encoding='bytes')
    datetime.datetime(2014, 7, 18, 13, 37, 37, 451000)

    I don't understand the Python3 loads() docs with respect to the "encoding" and "errors" arguments, and can't guess whether this is the intended way. It seems at best highly obscure. But hard to argue with something that works ;-)

    @eddygeek
    Copy link
    Mannequin Author

    eddygeek mannequin commented Jul 21, 2014

    The code works when using encoding='bytes'. Thanks Tim for the suggestion.

    So this is not a bug, but is there any sense in having encoding='ASCII' by default in pickle ?

    @tim-one
    Copy link
    Member

    tim-one commented Jul 21, 2014

    @EddyGeek, I'd still call something so unintuitive "a bug" - it's hard to believe this is the _intended_ way to get it to work. So I'd keep this open until someone with better knowledge of intent chimes in.

    @tanzerswingcoat
    Copy link
    Mannequin

    tanzerswingcoat mannequin commented Oct 12, 2015

    The code works when using encoding='bytes'. Thanks Tim for the suggestion.

    So this is not a bug, but is there any sense in having encoding='ASCII' by default in pickle ?

    It is most definitely a bug. And it adds another road block to moving python applications from 2.7 to 3.x!

    encoding='bytes' has serious side effects and isn't useful in the general case. For instance, it will result in dict-keys being unpickled as bytes instead of as str after which hilarity ensues.

    I got the exception

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 1: ordinal not in range(128)

    when testing an application for compatibility in Python 3.5 on a pickle created by Python 2.7. The pickled data is a nested data structure and it took me quite a while to determine that the single datetime instance was the culprit.

    Here is a small test case that reproduces the problem::

    # -*- coding: utf-8 -*-
    # pickle_dump.py 
    import datetime, pickle, uuid
    dti = datetime.datetime(2015, 10, 12, 13, 17, 42, 123456)
    data = { "ascii" : "abc", "text" : u"äbc", "int" :  42, "date-time" : dti }
    with open("/tmp/pickle.test", "wb") as file :
        pickle.dump(data, file, protocol=2)
    
    # pickle_load.py
    # -*- coding: utf-8 -*-
    import pickle
    with open("/tmp/pickle.test", "rb") as file :
        data = pickle.load(file)
    print(data)
    $ python2.7 pickle_dump.py
    $ python2.7 pickle_load.py 
    {'ascii': 'abc', 'text': u'\xe4bc', 'int': 42, 'date-time': datetime.datetime(2015, 10, 12, 13, 17, 42, 123456)}
    $ python3.5 pickle_load.py 
    Traceback (most recent call last):
      File "pickle_load.py", line 6, in <module>
        data = pickle.load(file)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 1: ordinal not in range(128)

    That error message is spectacularly useless.

    @serhiy-storchaka
    Copy link
    Member

    There are two issues here.

    1. datetime.datetime accepts only bytes, not str.
    2. Unpickling non-ASCII str pickled in Python 2 raises an error by default.

    The second issue usually hides the first one. The demonstration of the first issue:

    >>> pickle.loads(b'cdatetime\ndatetime\n(U\n\x07l\x01\x01\x00\x00\x00\x00\x00\x00tR.')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: an integer is required (got type str)

    The first issue can be solved by accepting str argument and encoding it to bytes. The second issue can be solved by changing an encoding or an error handler. Following patch uses the "surrogateescape" error handler.

    >>> pickle.loads(b'cdatetime\ndatetime\n(U\n\x07l\x01\x01\x00\x00\x00\x00\xc3\xa4tR.')
    datetime.datetime(1900, 1, 1, 0, 0, 0, 50084)

    Unfortunately setting the "surrogateescape" error handler by default has a side effect. It can hide string decoding errors. In addition, unpickling datetime will not always work with different encodings.

    >>> pickle.loads(b'cdatetime\ndatetime\n(U\n\x07l\x01\x01\x00\x00\x00\x00\xc3\xa4tR.', encoding='latin1')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-9: ordinal not in range(128)
    >>> pickle.loads(b'cdatetime\ndatetime\n(U\n\x07l\x01\x01\x00\x00\x00\x00\xc3\xa4tR.', encoding='utf-8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: an integer is required (got type str)

    @abalkin
    Copy link
    Member

    abalkin commented Oct 12, 2015

    The first issue can be solved by accepting str argument and encoding it to bytes.

    A strong -1 from me. Accepting bytes objects for year in 3.x (and str in 2.x) is a gross hack. In the long run, I would like to see a public named constructor, e.g. datetime.datetime.load to be used in datetime pickles.

    Can someone explain succinctly what the problem is? Does it only affect pickles transferred between 2.x and 3.x Pythons? If not, how can I reproduce the problem in 3.5?

    @serhiy-storchaka
    Copy link
    Member

    The problem is that you can't unpickle a data that contains both datetime
    classes (datetime, date, time) instances and strings (including attribute
    names, so actually this affects instances of any Python classes). Yes, it only
    affects pickles transferred between 2.x and 3.x Pythons.

    Yet one possible solution is to change datetime classes in 2.x to produce more
    portable pickles. But new pickles will be larger, and pickling and unpickling
    will be slower, and this doesn't solve a problem with existing pickled data.
    We still are receiving bug reports for 2.7.3 and like.

    @abalkin
    Copy link
    Member

    abalkin commented Oct 12, 2015

    I wonder if this can be fixed using a fix_imports hook. I agree, it would be nice to fix this issue by modifying 3.x versions only.

    @abalkin
    Copy link
    Member

    abalkin commented Oct 12, 2015

    .. pickling and unpickling will be slower

    If we are concerned about performance, we should definitely avoid the decode-encode roundtrip.

    @serhiy-storchaka
    Copy link
    Member

    Here is a patch against 2.7 that makes datetime pickling portable.

    It doesn't solve problem with existing pickled data, but at least it allows to convert existing pickled data with 2.7 to portable format.

    @serhiy-storchaka
    Copy link
    Member

    Here is alternative patch for 2.7. It makes datetime pickling produce the same data as in 3.x.

    The side effect of this approach: it makes datetime pickling incompatible with Unicode disabled builds of 2.x.

    @tanzerswingcoat
    Copy link
    Mannequin

    tanzerswingcoat mannequin commented Oct 15, 2015

    IMNSHO, the problem lies in the Python 3 pickle.py and it is **not** restricted to datetime instances
    (for a detailed rambling see http://c-tanzer.at/en/breaking_py2_pickles_in_py3.html) .

    In Python 2, 8-bit strings are used for text and for binary data. Well designed applications will use unicode for all text, but Python 2 itself forces some text to be 8-bit string, e.g., names of attributes, classes, and functions. In other words, **any 8-bit strings explicitly created by such an application will contain binary data.**

    In Python 2, pickle.dump uses BINSTRING (and SHORT_BINSTRING) for 8-bit strings; Python 3 uses BINBYTES (and SHORT_BINBYTES) instead.

    In Python 3, pickle.load should handle BINSTRING (and SHORT_BINSTRING) like this:

    • convert ASCII values to str

    • convert non-ASCII values to bytes

    bytes is Python 3's equivalent to Python 2's 8-bit string!

    It is only because of the use of 8-bit strings for Python 2 names that the mapping to str is necessary but all such names are guaranteed to be ASCII!

    I would propose to change load_binstring and load_short_binstring to call a function like::

        def _decode_binstring(self, value):
            # Used to allow strings from Python 2 to be decoded either as
            # bytes or Unicode strings.  This should be used only with the
            # BINSTRING and SHORT_BINSTRING opcodes.
            if self.encoding != "bytes":
                try :
                    return value.decode("ASCII")
                except UnicodeDecodeError:
                    pass
            return value

    instead of the currently called _decode_string.

    @abalkin
    Copy link
    Member

    abalkin commented Oct 15, 2015

    Christian,

    I don't think your solution will work for date/time/datetime pickles. There are many values for which pickle payload consists of bytes within 0-127 range. IIUC, you propose to decode those to Python 3 strings using ASCII encoding. This will in turn require accepting str type in date/time/datetime constructors.

    @tanzerswingcoat
    Copy link
    Mannequin

    tanzerswingcoat mannequin commented Oct 16, 2015

    Alexander Belopolsky wrote at Thu, 15 Oct 2015 17:56:42 +0000:

    I don't think your solution will work for date/time/datetime pickles.
    There are many values for which pickle payload consists of bytes
    within 0-127 range.

    Hmmmm.

    IIUC, you propose to decode those to Python 3
    strings using ASCII encoding.

    Yes. There are too many BINSTRING instances that need to be Python 3
    strings.

    This will in turn require accepting str
    type in date/time/datetime constructors.

    These datetime... constructors are strange beasts already.

    The documentation says that three integer arguments are required for
    datetime.datetime but it accepts a single bytes argument anyway. I
    agree that it would be much nicer if there was a
    datetime.datetime.load method instead. Unfortunately, that would
    require Guido's time machine to go back all the way to 2003 (at least).

    So yes, the only practical solution is to accept a single str typed
    argument (as long as it is ASCII only). An alternative would be to add
    a dispatch table for loading functions to Python 3's pickle that would
    be used by load_global. That would add indirection for the datetime
    constructors but would allow support for other types requiring
    arguments of type bytes.

    The change I proposed in http://bugs.python.org/issue22005#msg253042
    to fix the handling of binary 8-bit strings is still necessary.

    To summarize:

    IMHO the solution needs to be implemented in Python 3 — otherwise
    pickles with binary strings created by Python 2.x cannot be loaded in
    Python 3. Changing the pickle implementation of Python 2 doesn't fix
    existing pickles and couldn't fix the general problem of binary
    strings, anyway.

    @tanzerswingcoat
    Copy link
    Mannequin

    tanzerswingcoat mannequin commented Apr 14, 2016

    This issue is getting old. Is there any way to solve this for Python 3.6?

    @gpshead
    Copy link
    Member

    gpshead commented Mar 2, 2017

    TL;DR - Just one more example of why nobody should *ever* use pickle under any circumstances. It is useless for data that is not transient for consumption by the same exact versions of all software that created it.

    Patches against 2.7 are not useful here. Either we write a unpickle deserializer for python 2 datetime pickles that works for all existing previous datatime pickled data formats from Python 3. Or we close this as rejected because the data formats are rightly incompatible as the in-process object states are incompatible between the two versions.

    If you want to serialize something, use a language agnostic data format - ideally one with a defined schema. Never pickle.

    Advice for those who have stored such data in Python 2 pickles: Write a Python 2 program to read your data and rewrite it in a portable data format that has nothing to do with pickle. Anything else is a gross hack.

    @serhiy-storchaka
    Copy link
    Member

    NumPy starves from the same issue. In NumPy this problem was solved by requiring encoding='latin1' passed to unpickler. It makes sense to use the same approach for datetime classes.

    @serhiy-storchaka serhiy-storchaka added 3.7 (EOL) end of life 3.8 only security fixes labels Dec 6, 2018
    @serhiy-storchaka
    Copy link
    Member

    New changeset 8452ca1 by Serhiy Storchaka in branch 'master':
    bpo-22005: Fixed unpickling instances of datetime classes pickled by Python 2. (GH-11017)
    8452ca1

    @serhiy-storchaka
    Copy link
    Member

    New changeset 0d5730e by Serhiy Storchaka in branch '3.7':
    [3.7] bpo-22005: Fixed unpickling instances of datetime classes pickled by Python 2. (GH-11017) (GH-11022)
    0d5730e

    @serhiy-storchaka
    Copy link
    Member

    New changeset 19f6e83 by Serhiy Storchaka (Miss Islington (bot)) in branch '3.6':
    bpo-22005: Fixed unpickling instances of datetime classes pickled by Python 2. (GH-11017) (GH-11022) (GH-11024)
    19f6e83

    @serhiy-storchaka
    Copy link
    Member

    New changeset 1133a8c by Serhiy Storchaka in branch 'master':
    bpo-22005: Fix condition for unpickling a date object. (GH-11025)
    1133a8c

    @pganssle
    Copy link
    Member

    pganssle commented Dec 7, 2018

    I'm not sure I agree with how this was resolved. We're adding complexity to the datetime unpickler to support unpickling pickles created in Python 2 in Python 3? I also don't really understand the encoding parts of it, but it smells very fishy to me.

    I agree with Gregory here, pickle has so many other problems when used between versions of Python that it's simply not useful for cross-version serialization. It is useful for things like inter-process communication between two trusted instances of Python programs running the same version.

    Also, what is the plan here for 2020+? Do we remove this hack for Python 3.9, or are we stuck with it indefinitely?

    @serhiy-storchaka
    Copy link
    Member

    This is the same hack as in NumPy, so we are at least consistent here. I think we have to keep it some time after 2020, maybe to 2025.

    @pganssle
    Copy link
    Member

    pganssle commented Dec 7, 2018

    @serhiy Any chance we can roll these back before the release so that they can have some time for discussion? I have serious concerns about having to support some Python 2/3 compatibility hack in datetime for the next 6 years. If this is worth doing at all, I think it can safely wait until the next release.

    @serhiy-storchaka
    Copy link
    Member

    This issue is already open for a long time. There is a problem which can not be worked around from the user side. I want to get it solved in 3.6, and today is the last chance for this. This is important for migrating from Python 2 to Python 3. You can open a discussion on Python-Dev, and if there will be significant opposition, this change can be reverted before releasing the final version of 3.6.8.

    @pganssle
    Copy link
    Member

    pganssle commented Dec 7, 2018

    I do not care enough about this to fight about it.

    The issue has been open long enough that I do not think it justified the urgency of rushing through a patch just before the release and merging without review, but now that it is in the release of multiple versions, I think we may be stuck with it.

    This *is* something that users can work around by not abusing pickle in this way and instead using a proper cross-platform serialization format. I realize that that makes it *more difficult* for some people to do so, but as Gregory points out, these people are doing dangerous stuff that will break in a way that we are not going to be willing or able to fix at some point *anyway*.

    @tanzerswingcoat
    Copy link
    Mannequin

    tanzerswingcoat mannequin commented Dec 9, 2018

    Paul Ganssle wrote at Fri, 07 Dec 2018 17:22:36 +0000:

    > Gregory P. Smith (gregory.p.smith) 2017-03-02 18:57
    > TL;DR - Just one more example of why nobody should *ever* use pickle
    > under any circumstances. It is useless for data that is not transient
    > for consumption by the same exact versions of all software that
    > created it.

    This *is* something that users can work around by not abusing pickle
    in this way and instead using a proper cross-platform serialization
    format. I realize that that makes it *more difficult* for some people
    to do so, but as Gregory points out, these people are doing dangerous
    stuff that will break in a way that we are not going to be willing or
    able to fix at some point *anyway*.

    This is completely and utterly wrong, to put it mildly.

    The official documentation of the pickle module states (I checked 2.7
    and 3.7):

    The pickle serialization format is guaranteed to be backwards
    compatible across Python releases.
    

    Considering that this issue is 4.5 years old, one would assume that the
    pickle documentation would have been changed in the meantime if
    Gregory's and Paul's view matched reality.

    But my or your personal views about the usability of pickle don't
    matter anyway. There are too many libraries and applications that have
    been using pickle for many years.

    I personally know about this kind of usage in applications since 1998.
    In that particular case, the pickled information resides on machines
    owned by the customers of the applications and **must** be readable by
    any new version of the application no matter how old the file
    containing the pickle is. Rewriting history by some Python developers
    is not going to impress the companies involved!

    Have a nice day!

    @gpshead
    Copy link
    Member

    gpshead commented Dec 9, 2018

    New changeset e328753 by Gregory P. Smith in branch 'master':
    bpo-22005: Document the reality of pickle compatibility. (GH-11054)
    e328753

    @miss-islington
    Copy link
    Contributor

    New changeset 331bfa4 by Miss Islington (bot) in branch '3.7':
    bpo-22005: Document the reality of pickle compatibility. (GH-11054)
    331bfa4

    @gpshead
    Copy link
    Member

    gpshead commented Dec 9, 2018

    It is fundamentally impossible for pickled data to magically cross the 2 and 3 language boundary unscathed.

    The basic str/bytes/unicode types in the language changed meaning. Code must be written manually by the data owners to fix that up based on what the types and encodings should actually be in various places given the language version the data is being read into.

    The code in the PRs for this bug appears to do that in the requisite required hacky manner for stored datetime instances.

    This fact isn't new. It happened 10 years ago with the release of Python 3.0. The documentation is not a contract. I'm fixing it to mention this.

    @gpshead
    Copy link
    Member

    gpshead commented Dec 9, 2018

    Serhiy: should this one be marked fixed?

    @serhiy-storchaka
    Copy link
    Member

    With Gregory's addition I think this issue can be considered fixed. Thank you Gregory.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants