classification
Title: Reproducible pyc: frozenset is not serialized in a deterministic order
Type: Stage:
Components: Interpreter Core Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: FFY00, jefferyto, rhettinger, vstinner
Priority: normal Keywords:

Created on 2019-07-15 15:05 by vstinner, last changed 2021-04-16 04:44 by yan12125.

Messages (7)
msg347969 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-07-15 15:05
See bpo-29708 meta issue and https://reproducible-builds.org/ for reproducible builds.

pyc files are not fully reproducible yet: frozenset items are not serialized in a deterministic order

One solution would be to modify marshal to sort frozenset items before serializing them. The issue is how to handle items which cannot be compared. Example:

>>> l=[float("nan"), b'bytes', 'unicode']
>>> l.sort()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'bytes' and 'float'

One workaround for types which cannot be compared is to use the type name in the key used to compare items:

>>> l.sort(key=lambda x: (type(x).__name__, x))
>>> l
[b'bytes', nan, 'unicode']

Note: comparison between bytes and str raises a BytesWarning exception when using python3 -bb.

Second problem: how to handle exceptions when comparison raises an error anyway?


Another solution would be to use the PYTHONHASHSEED environment variable. For example, if SOURCE_DATE_EPOCH is set, PYTHONHASHSEED would be set to 0. This option is not my favorite because it disables a security fix against denial of service on dict and set:
https://python-security.readthedocs.io/vuln/hash-dos.html

--

Previous discussions on reproducible frozenset:

* https://mail.python.org/pipermail/python-dev/2018-July/154604.html
* https://bugs.python.org/issue34093#msg321523

See also bpo-34093: "Reproducible pyc: FLAG_REF is not stable" and PEP 552 "Deterministic pycs".
msg366124 - (view) Author: Chih-Hsuan Yen (yan12125) * Date: 2020-04-10 13:22
issue34722 also talks about frozenset, nondeterministic order and sorting. Maybe this ticket and that one are for the same issue?
msg391118 - (view) Author: Filipe Laíns (FFY00) * Date: 2021-04-15 02:38
Normal sets have the same issue, see bpo-43850.

Would it be reasonable to make it so that sets are always created with the definition order? Looking at the set implementation, this seems perfectly possible.
msg391156 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-04-15 23:49
> Would it be reasonable to make it so that sets are 
> always created with the definition order?

No, it would not.  We would also have to maintain order across set operations such as intersection which which would become dramatically more expensive if they had to maintain order.  For example intersecting a million element set with a ten element set always takes ten steps regardless of the order of arguments, but to maintain order of the left hand operand could take a hundred times more work.
msg391157 - (view) Author: Filipe Laíns (FFY00) * Date: 2021-04-16 00:10
> No, it would not.  We would also have to maintain order across set operations such as intersection which which would become dramatically more expensive if they had to maintain order.  For example intersecting a million element set with a ten element set always takes ten steps regardless of the order of arguments, but to maintain order of the left hand operand could take a hundred times more work.

Can these operations happen during bytecode generation? I am fairly new to these internals so my understanding is not great. During bytecode generation is can code that performs such operations run?
msg391158 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-04-16 00:12
s/hundred/hundred thousand/
msg391159 - (view) Author: Filipe Laíns (FFY00) * Date: 2021-04-16 00:13
s/is can/can/
History
Date User Action Args
2021-04-16 04:44:50yan12125setnosy: - yan12125
2021-04-16 00:13:23FFY00setmessages: + msg391159
2021-04-16 00:12:47rhettingersetmessages: + msg391158
2021-04-16 00:10:53FFY00setmessages: + msg391157
2021-04-15 23:49:04rhettingersetnosy: + rhettinger
messages: + msg391156
2021-04-15 02:38:55FFY00setnosy: + FFY00
messages: + msg391118
2021-02-03 18:09:23steve.dowerunlinkissue29708 dependencies
2020-10-22 20:39:28eric.araujolinkissue29708 dependencies
2020-04-10 13:22:03yan12125setnosy: + yan12125
messages: + msg366124
2020-04-08 12:47:20jefferytosetnosy: + jefferyto
2019-07-15 15:05:11vstinnercreate