Title: distutils is not reproducible
Type: Stage: patch review
Components: Library (Lib) Versions: Python 3.8
Status: open Resolution:
Dependencies: 31377 34093 Superseder:
Assigned To: Nosy List: benjamin.peterson, bmwiedemann, inada.naoki, mcepl, sascha_silbe, vstinner, zbysz
Priority: normal Keywords: patch

Created on 2018-07-03 15:46 by vstinner, last changed 2019-03-15 08:58 by bmwiedemann.

Pull Requests
URL Status Linked Edit
PR 8057 closed vstinner, 2018-07-03 15:47
PR 8226 open inada.naoki, 2018-07-10 12:23
Messages (8)
msg320988 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-03 15:46
Follow up of bpo-29708: OpenSUSE uses a downstream patch for distutils to fix distutils-reproducible-compile.patch. I converted the patch as a PR: PR 8057.

Naoki INADA wrote:
Currently, marshal uses refcnt to determine using w_ref or not. Some immutable objects (especially, long and str) can be cached and reused. It may affects refcnt when byte compiling.

I think we should use more deterministic way instead of refcnt. Maybe, count all constants in the module before marshal, like we did in compiling function for co_consts and co_names.
As a bonus, it may reduce resource usage too by merging constants over functions.
(e.g. ('self',) co_varnames and (None,) co_consts)

Serhiy Storchaka added:
I think we need to understand the issue better before committing changes. When found the source of unstability of file names, we can find other similar sources and make them stable too. For example if the source is listdir() or glob(), we can consider sorting results of all listdir() or glob() in distutils and related methods.

On other side, if the problem is with reference counters in marshal, we can change the marshal module instead.
msg320990 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-03 15:47
Copy of first message:
e.g. python-simplejson has one-bit diffs in .pyc files

in python3-simplejson.rpm we get
-00004e50  68 6f 72 5f 5f da 07 64  65 63 69 6d 61 6c 72 0c  |hor__..decimalr.|
+00004e50  68 6f 72 5f 5f 5a 07 64  65 63 69 6d 61 6c 72 0c  |hor__Z.decimalr.|

in python3-simplejson-test.rpm we get the opposite change
-00000580  72 13 00 00 00 5a 07 64  65 63 69 6d 61 6c 72 03  |r....Z.decimalr.|
+00000580  72 13 00 00 00 da 07 64  65 63 69 6d 61 6c 72 03  |r......decimalr.|

and it seems to be related to filesystem ordering, since it built reproducibly
when using a filesystem with sorted readdir
using disorderfs via reproducible-faketools-filesys from
msg320991 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2018-07-03 15:50
I agree that we should fix the underlying issue (marshal) rather than papering over it by sorting. In fact, we should have a test that compiles a bunch of pycs in a random orders and sees if they're the same or not.
msg321383 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2018-07-10 12:14
Is this issue for only known marshal issue?
Or is this issue for all issues in distutils including unknowns?
msg321408 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2018-07-11 04:39
We should probably discuss the marshal issue in the preëxisting #31377.

I'm not sure if "distutils is not reproducible" is a larger issue than "pyc compilation is not reproducible". This issue could be a meta issue for either.
msg321432 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-11 10:33
> Is this issue for only known marshal issue?

IMHO the order in which .pyc files are created on disk also matters. It changes the result of "os.listdir()": some application can rely on unsorted os.listdir(). sorted() seems simple and hardless compared to the benefit.
msg321434 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2018-07-11 10:37
OK, I created sub issue for pyc.
msg337975 - (view) Author: Bernhard M. Wiedemann (bmwiedemann) * Date: 2019-03-15 08:58
unreproducible .pyc files are still one of the major headaches for my work on openSUSE reproducible builds.

There is also one aspect where i586 builds end up with different .pyc files than x86_64 builds. And then we randomly chose one of them for our "noarch" python module packages and hope they work everywhere (including on arm and s390 architectures).

So is someone working towards a concept that makes it is possible to create the same .pyc files anywhere?
Can I help something there?
Is there an ETA?
Date User Action Args
2019-03-15 08:58:26bmwiedemannsetnosy: + bmwiedemann
messages: + msg337975
2019-03-06 15:46:44zbyszsetnosy: + zbysz
2018-11-13 13:29:54sascha_silbesetnosy: + sascha_silbe
2018-07-11 10:37:20inada.naokisetdependencies: + remove *_INTERNED opcodes from marshal, Reproducible pyc: FLAG_REF is not stable.
messages: + msg321434
2018-07-11 10:33:20vstinnersetmessages: + msg321432
2018-07-11 04:39:09benjamin.petersonsetmessages: + msg321408
2018-07-10 12:23:27inada.naokisetpull_requests: + pull_request7764
2018-07-10 12:14:36inada.naokisetnosy: + inada.naoki
messages: + msg321383
2018-07-04 23:27:10mceplsetnosy: + mcepl
2018-07-03 15:50:22benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg320991
2018-07-03 15:47:56vstinnersetmessages: + msg320990
2018-07-03 15:47:04vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request7677
2018-07-03 15:46:25vstinnercreate