New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unreproducible bytecode: set order depends on random seed for compiled bytecode #88016
Comments
Currently, the order of set or frozenset elements when saved to bytecode is dependent on the random seed. This breaks reproducibility. Example fail from an Arch Linux package: https://reproducible.archlinux.org/api/v0/builds/88454/diffoscope Let's take an example file, s = {
'aaa',
'bbb',
'ccc',
'ddd',
'eee',
} $ PYTHONHASHSEED=0 python -m compileall --invalidation-mode checked-hash test_compile.py
$ mv __pycache__ __pycache__1
$ PYTHONHASHSEED=1 python -m compileall --invalidation-mode checked-hash test_compile.py
$ diff __pycache__/test_compile.cpython-39.pyc __pycache__1/test_compile.cpython-39.pyc
Binary files __pycache__/test_compile.cpython-39.pyc and __pycache__1/test_compile.cpython-39.pyc differ
$ diff <(xxd __pycache__/test_compile.cpython-39.pyc) <(xxd __pycache__1/test_compile.cpython-39.pyc)
5,6c5,6
< 00000040: 005a 0362 6262 5a03 6464 645a 0361 6161 .Z.bbbZ.dddZ.aaa
< 00000050: 5a03 6363 635a 0365 6565 4e29 01da 0173 Z.cccZ.eeeN)...s
I believe the issue is in the marshall module. Particularly, this line[1]. My simple fix was to create a list from the set, sort it, and iterate over it instead. [1] Line 505 in 00d7abd
|
I just realized my fix is wrong because list.sort does not handle different types. Similarly to other reproducibility fixes, how does skipping the item randomization when SOURCE_DATE_EPOCH is set sound? |
Nevermind, AFAIK that depends on the hash seed, correct? So, the most viable option to me would be a sorting algorithm that could take type into account. Would that be an acceptable solution? |
Sorry for the spam, I am trying to figure out the best option here, which is hard to do by myself. IMO it would be reasonable to create set objects with elements in the order they appear in the code, instead of based on the hash. I am not really sure where is the code responsible for this, and if there are any limitations preventing this from being implemented. So, my question are: Would you consider this reasonable? Is there anything I am missing? |
Let's keep any discussion on the preëxisting issue for this. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: