classification
Title: distutils is not reproducible
Type: Stage: patch review
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: 31377 34093 Superseder:
Assigned To: Nosy List: benjamin.peterson, inada.naoki, mcepl, sascha_silbe, vstinner
Priority: normal Keywords: patch

Created on 2018-07-03 15:46 by vstinner, last changed 2018-11-13 13:29 by sascha_silbe.

Pull Requests
URL Status Linked Edit
PR 8057 closed vstinner, 2018-07-03 15:47
PR 8226 open inada.naoki, 2018-07-10 12:23
Messages (7)
msg320988 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-03 15:46
Follow up of bpo-29708: OpenSUSE uses a downstream patch for distutils to fix https://bugzilla.opensuse.org/show_bug.cgi?id=1049186: distutils-reproducible-compile.patch. I converted the patch as a PR: PR 8057.

Naoki INADA wrote:
"""
Currently, marshal uses refcnt to determine using w_ref or not. Some immutable objects (especially, long and str) can be cached and reused. It may affects refcnt when byte compiling.

I think we should use more deterministic way instead of refcnt. Maybe, count all constants in the module before marshal, like we did in compiling function for co_consts and co_names.
As a bonus, it may reduce resource usage too by merging constants over functions.
(e.g. ('self',) co_varnames and (None,) co_consts)
"""
https://github.com/python/cpython/pull/8057#issuecomment-402065657

Serhiy Storchaka added:
"""
I think we need to understand the issue better before committing changes. When found the source of unstability of file names, we can find other similar sources and make them stable too. For example if the source is listdir() or glob(), we can consider sorting results of all listdir() or glob() in distutils and related methods.

On other side, if the problem is with reference counters in marshal, we can change the marshal module instead.
"""
https://github.com/python/cpython/pull/8057#issuecomment-402198390
msg320990 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-03 15:47
Copy of https://bugzilla.opensuse.org/show_bug.cgi?id=1049186 first message:
"""
e.g. python-simplejson has one-bit diffs in .pyc files
See
http://rb.zq1.de/compare.factory-20170713/python-simplejson-compare.out


in python3-simplejson.rpm we get
-00004e50  68 6f 72 5f 5f da 07 64  65 63 69 6d 61 6c 72 0c  |hor__..decimalr.|
+00004e50  68 6f 72 5f 5f 5a 07 64  65 63 69 6d 61 6c 72 0c  |hor__Z.decimalr.|

in python3-simplejson-test.rpm we get the opposite change
-00000580  72 13 00 00 00 5a 07 64  65 63 69 6d 61 6c 72 03  |r....Z.decimalr.|
+00000580  72 13 00 00 00 da 07 64  65 63 69 6d 61 6c 72 03  |r......decimalr.|


and it seems to be related to filesystem ordering, since it built reproducibly
when using a filesystem with sorted readdir
using disorderfs via reproducible-faketools-filesys from
https://build.opensuse.org/package/show/home:bmwiedemann:reproducible/reproducible-faketools
"""
https://bugzilla.opensuse.org/show_bug.cgi?id=1049186#c0
msg320991 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2018-07-03 15:50
I agree that we should fix the underlying issue (marshal) rather than papering over it by sorting. In fact, we should have a test that compiles a bunch of pycs in a random orders and sees if they're the same or not.
msg321383 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-07-10 12:14
Is this issue for only known marshal issue?
Or is this issue for all issues in distutils including unknowns?
msg321408 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2018-07-11 04:39
We should probably discuss the marshal issue in the preëxisting #31377.

I'm not sure if "distutils is not reproducible" is a larger issue than "pyc compilation is not reproducible". This issue could be a meta issue for either.
msg321432 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-11 10:33
> Is this issue for only known marshal issue?

IMHO the order in which .pyc files are created on disk also matters. It changes the result of "os.listdir()": some application can rely on unsorted os.listdir(). sorted() seems simple and hardless compared to the benefit.
msg321434 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-07-11 10:37
OK, I created sub issue for pyc.
History
Date User Action Args
2018-11-13 13:29:54sascha_silbesetnosy: + sascha_silbe
2018-07-11 10:37:20inada.naokisetdependencies: + remove *_INTERNED opcodes from marshal, Reproducible pyc: FLAG_REF is not stable.
messages: + msg321434
2018-07-11 10:33:20vstinnersetmessages: + msg321432
2018-07-11 04:39:09benjamin.petersonsetmessages: + msg321408
2018-07-10 12:23:27inada.naokisetpull_requests: + pull_request7764
2018-07-10 12:14:36inada.naokisetnosy: + inada.naoki
messages: + msg321383
2018-07-04 23:27:10mceplsetnosy: + mcepl
2018-07-03 15:50:22benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg320991
2018-07-03 15:47:56vstinnersetmessages: + msg320990
2018-07-03 15:47:04vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request7677
2018-07-03 15:46:25vstinnercreate