This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: distutils is not reproducible
Type: Stage: patch review
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: 31377 34093 Superseder:
Assigned To: Nosy List: benjamin.peterson, bmwiedemann, jefferyto, methane, petr.viktorin, sascha_silbe, vstinner, yan12125, zbysz
Priority: normal Keywords: patch

Created on 2018-07-03 15:46 by vstinner, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 8057 closed vstinner, 2018-07-03 15:47
PR 8226 open methane, 2018-07-10 12:23
Messages (9)
msg320988 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-03 15:46
Follow up of bpo-29708: OpenSUSE uses a downstream patch for distutils to fix https://bugzilla.opensuse.org/show_bug.cgi?id=1049186: distutils-reproducible-compile.patch. I converted the patch as a PR: PR 8057.

Naoki INADA wrote:
"""
Currently, marshal uses refcnt to determine using w_ref or not. Some immutable objects (especially, long and str) can be cached and reused. It may affects refcnt when byte compiling.

I think we should use more deterministic way instead of refcnt. Maybe, count all constants in the module before marshal, like we did in compiling function for co_consts and co_names.
As a bonus, it may reduce resource usage too by merging constants over functions.
(e.g. ('self',) co_varnames and (None,) co_consts)
"""
https://github.com/python/cpython/pull/8057#issuecomment-402065657

Serhiy Storchaka added:
"""
I think we need to understand the issue better before committing changes. When found the source of unstability of file names, we can find other similar sources and make them stable too. For example if the source is listdir() or glob(), we can consider sorting results of all listdir() or glob() in distutils and related methods.

On other side, if the problem is with reference counters in marshal, we can change the marshal module instead.
"""
https://github.com/python/cpython/pull/8057#issuecomment-402198390
msg320990 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-03 15:47
Copy of https://bugzilla.opensuse.org/show_bug.cgi?id=1049186 first message:
"""
e.g. python-simplejson has one-bit diffs in .pyc files
See
http://rb.zq1.de/compare.factory-20170713/python-simplejson-compare.out


in python3-simplejson.rpm we get
-00004e50  68 6f 72 5f 5f da 07 64  65 63 69 6d 61 6c 72 0c  |hor__..decimalr.|
+00004e50  68 6f 72 5f 5f 5a 07 64  65 63 69 6d 61 6c 72 0c  |hor__Z.decimalr.|

in python3-simplejson-test.rpm we get the opposite change
-00000580  72 13 00 00 00 5a 07 64  65 63 69 6d 61 6c 72 03  |r....Z.decimalr.|
+00000580  72 13 00 00 00 da 07 64  65 63 69 6d 61 6c 72 03  |r......decimalr.|


and it seems to be related to filesystem ordering, since it built reproducibly
when using a filesystem with sorted readdir
using disorderfs via reproducible-faketools-filesys from
https://build.opensuse.org/package/show/home:bmwiedemann:reproducible/reproducible-faketools
"""
https://bugzilla.opensuse.org/show_bug.cgi?id=1049186#c0
msg320991 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2018-07-03 15:50
I agree that we should fix the underlying issue (marshal) rather than papering over it by sorting. In fact, we should have a test that compiles a bunch of pycs in a random orders and sees if they're the same or not.
msg321383 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2018-07-10 12:14
Is this issue for only known marshal issue?
Or is this issue for all issues in distutils including unknowns?
msg321408 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2018-07-11 04:39
We should probably discuss the marshal issue in the preëxisting #31377.

I'm not sure if "distutils is not reproducible" is a larger issue than "pyc compilation is not reproducible". This issue could be a meta issue for either.
msg321432 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-11 10:33
> Is this issue for only known marshal issue?

IMHO the order in which .pyc files are created on disk also matters. It changes the result of "os.listdir()": some application can rely on unsorted os.listdir(). sorted() seems simple and hardless compared to the benefit.
msg321434 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2018-07-11 10:37
OK, I created sub issue for pyc.
msg337975 - (view) Author: Bernhard M. Wiedemann (bmwiedemann) * Date: 2019-03-15 08:58
unreproducible .pyc files are still one of the major headaches for my work on openSUSE reproducible builds.

There is also one aspect where i586 builds end up with different .pyc files than x86_64 builds. And then we randomly chose one of them for our "noarch" python module packages and hope they work everywhere (including on arm and s390 architectures).

So is someone working towards a concept that makes it is possible to create the same .pyc files anywhere?
Can I help something there?
Is there an ETA?
msg359595 - (view) Author: Petr Viktorin (petr.viktorin) * (Python committer) Date: 2020-01-08 14:05
> There is also one aspect where i586 builds end up with different .pyc files than x86_64 builds. And then we randomly chose one of them for our "noarch" python module packages and hope they work everywhere (including on arm and s390 architectures).

They are functionally identical, despite not being bit-by-bit identical.
If they do not work everywhere, it's a very serious bug.

> So is someone working towards a concept that makes it is possible to create the same .pyc files anywhere?

No, it's a known issue no one is working on.

> Can I help something there?

Maybe?
The two main culprits are in the marshal serialization algorithm:  https://github.com/python/cpython/blob/master/Python/marshal.c
Specifically:
- a heuristic depends on refcount (i.e. state of objects in the entire interpreter, rather than just relationships between serialized objects): https://github.com/python/cpython/blob/33b671e72450bf4b5a946ce0dde6b7fe21150108/Python/marshal.c#L304
- (frozen)sets are serialized in iteration order, which is unpredictable (and determinig a predictable order is not trivial): https://github.com/python/cpython/blob/33b671e72450bf4b5a946ce0dde6b7fe21150108/Python/marshal.c#L498

A solution will probably come with an unacceptable performance hit -- it's good to keep generating the .pyc files fast. Two options to overcome that come to mind:
- make reproducibility optional (which would make the testing more cumbersome)
- make an add-on tool to re-serialize an existing .pyc.
History
Date User Action Args
2022-04-11 14:59:02adminsetgithub: 78214
2020-04-10 13:23:09yan12125setnosy: + yan12125
2020-04-08 12:50:37jefferytosetnosy: + jefferyto
2020-02-24 16:35:26mceplsetnosy: - mcepl
2020-01-08 14:05:06petr.viktorinsetnosy: + petr.viktorin
messages: + msg359595
2019-03-15 08:58:26bmwiedemannsetnosy: + bmwiedemann
messages: + msg337975
2019-03-06 15:46:44zbyszsetnosy: + zbysz
2018-11-13 13:29:54sascha_silbesetnosy: + sascha_silbe
2018-07-11 10:37:20methanesetdependencies: + remove *_INTERNED opcodes from marshal, Reproducible pyc: FLAG_REF is not stable.
messages: + msg321434
2018-07-11 10:33:20vstinnersetmessages: + msg321432
2018-07-11 04:39:09benjamin.petersonsetmessages: + msg321408
2018-07-10 12:23:27methanesetpull_requests: + pull_request7764
2018-07-10 12:14:36methanesetnosy: + methane
messages: + msg321383
2018-07-04 23:27:10mceplsetnosy: + mcepl
2018-07-03 15:50:22benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg320991
2018-07-03 15:47:56vstinnersetmessages: + msg320990
2018-07-03 15:47:04vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request7677
2018-07-03 15:46:25vstinnercreate