classification
Title: unreproducible bytecode: set order depends on random seed for compiled bytecode
Type: behavior Stage: resolved
Components: Versions:
process
Status: closed Resolution: duplicate
Dependencies: Superseder: support reproducible Python builds
View: 29708
Assigned To: Nosy List: FFY00, Mark.Shannon, benjamin.peterson, christian.heimes, yselivanov
Priority: normal Keywords: patch

Created on 2021-04-14 22:31 by FFY00, last changed 2021-04-15 00:33 by benjamin.peterson. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 25411 closed FFY00, 2021-04-14 22:35
Messages (5)
msg391104 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2021-04-14 22:31
Currently, the order of set or frozenset elements when saved to bytecode is dependent on the random seed. This breaks reproducibility.

Example fail from an Arch Linux package: https://reproducible.archlinux.org/api/v0/builds/88454/diffoscope

Let's take an example file, `test_compile.py`
```python
s = {
    'aaa',
    'bbb',
    'ccc',
    'ddd',
    'eee',
}
```

$ PYTHONHASHSEED=0 python -m compileall --invalidation-mode checked-hash test_compile.py
$ mv __pycache__ __pycache__1
$ PYTHONHASHSEED=1 python -m compileall --invalidation-mode checked-hash test_compile.py

$ diff __pycache__/test_compile.cpython-39.pyc __pycache__1/test_compile.cpython-39.pyc
Binary files __pycache__/test_compile.cpython-39.pyc and __pycache__1/test_compile.cpython-39.pyc differ

$ diff <(xxd __pycache__/test_compile.cpython-39.pyc) <(xxd __pycache__1/test_compile.cpython-39.pyc)
5,6c5,6
< 00000040: 005a 0362 6262 5a03 6464 645a 0361 6161  .Z.bbbZ.dddZ.aaa
< 00000050: 5a03 6363 635a 0365 6565 4e29 01da 0173  Z.cccZ.eeeN)...s
---
> 00000040: 005a 0361 6161 5a03 6363 635a 0364 6464  .Z.aaaZ.cccZ.ddd
> 00000050: 5a03 6565 655a 0362 6262 4e29 01da 0173  Z.eeeZ.bbbN)...s

I believe the issue is in the marshall module. Particularly, this line[1]. My simple fix was to create a list from the set, sort it, and iterate over it instead.

[1] https://github.com/python/cpython/blob/00d7abd7ef588fc4ff0571c8579ab4aba8ada1c0/Python/marshal.c#L505
msg391111 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2021-04-14 23:30
I just realized my fix is wrong because list.sort does not handle different types. Similarly to other reproducibility fixes, how does skipping the item randomization when SOURCE_DATE_EPOCH is set sound?
msg391112 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2021-04-14 23:42
Nevermind, AFAIK that depends on the hash seed, correct?  So, the most viable option to me would be a sorting algorithm that could take type into account. Would that be an acceptable solution?
msg391113 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2021-04-15 00:18
Sorry for the spam, I am trying to figure out the best option here, which is hard to do by myself.

IMO it would be reasonable to create set objects with elements in the order they appear in the code, instead of based on the hash. I am not really sure where is the code responsible for this, and if there are any limitations preventing this from being implemented.

So, my question are: Would you consider this reasonable? Is there anything I am missing?
If there are no issues, could someone point me to the target code?
msg391114 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2021-04-15 00:33
Let's keep any discussion on the preëxisting issue for this.
History
Date User Action Args
2021-04-15 00:33:16benjamin.petersonsetstatus: open -> closed
resolution: duplicate
messages: + msg391114

superseder: support reproducible Python builds
stage: patch review -> resolved
2021-04-15 00:18:50FFY00setmessages: + msg391113
2021-04-14 23:42:46FFY00setmessages: + msg391112
2021-04-14 23:30:21FFY00setmessages: + msg391111
2021-04-14 22:39:31FFY00setnosy: + christian.heimes
2021-04-14 22:35:00FFY00setkeywords: + patch
stage: patch review
pull_requests: + pull_request24143
2021-04-14 22:31:26FFY00create