Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support reproducible Python builds #73894

Open
bmwiedemann mannequin opened this issue Mar 3, 2017 · 57 comments
Open

support reproducible Python builds #73894

bmwiedemann mannequin opened this issue Mar 3, 2017 · 57 comments
Labels
3.10 only security fixes build The build process and cross-build

Comments

@bmwiedemann
Copy link
Mannequin

bmwiedemann mannequin commented Mar 3, 2017

BPO 29708
Nosy @warsaw, @vstinner, @ericvsmith, @benjaminp, @mcepl, @merwok, @methane, @zooba, @dstufft, @bmwiedemann, @FRidh, @commodo, @mingwandroid, @eli-schwartz, @miss-islington, @jefferyto, @obfusk
PRs
  • bpo-29708: support SOURCE_DATE_EPOCH env var in py_compile (allow for reproducible builds of python packages) #296
  • bpo-29708: allow to force hash-based pycs #5200
  • bpo-29708: Add What's New entries for SOURCE_DATE_EPOCH and py_compile #5306
  • bpo-29708: support SOURCE_DATE_EPOCH for build info #5313
  • bpo-34033: distutils: byte_compile() sort files #8057
  • bpo-34022: Stop forcing of hash-based invalidation with SOURCE_DATE_EPOCH #9607
  • [3.7] bpo-34022: Stop forcing of hash-based invalidation with SOURCE_DATE_EPOCH (GH-9607) #10775
  • Files
  • python39_2.html: Python 3.9.1 diffoscope report
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2017-03-03.11:36:06.114>
    labels = ['build', '3.10']
    title = 'support reproducible Python builds'
    updated_at = <Date 2021-04-22.17:01:17.438>
    user = 'https://github.com/bmwiedemann'

    bugs.python.org fields:

    activity = <Date 2021-04-22.17:01:17.438>
    actor = 'obfusk'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Build']
    creation = <Date 2017-03-03.11:36:06.114>
    creator = 'bmwiedemann'
    dependencies = []
    files = ['49708']
    hgrepos = []
    issue_num = 29708
    keywords = ['patch']
    message_count = 40.0
    messages = ['288880', '288883', '288889', '288948', '301354', '309394', '309395', '309401', '309870', '309905', '309931', '309972', '310010', '310012', '310292', '310636', '310637', '310652', '310661', '311317', '313312', '313313', '313383', '313384', '313391', '320942', '320989', '321002', '327480', '330623', '347971', '384065', '384066', '384099', '384100', '384104', '384110', '384123', '386272', '391616']
    nosy_count = 18.0
    nosy_names = ['barry', 'vstinner', 'eric.smith', 'benjamin.peterson', 'mcepl', 'eric.araujo', 'sascha_silbe', 'methane', 'steve.dower', 'dstufft', 'bmwiedemann', 'Frederik Rietdijk', 'Alexandru Ardelean', 'Ray Donnelly', 'eschwartz', 'miss-islington', 'jefferyto', 'obfusk']
    pr_nums = ['296', '5200', '5306', '5313', '8057', '9607', '10775']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue29708'
    versions = ['Python 3.10']

    @bmwiedemann
    Copy link
    Mannequin Author

    bmwiedemann mannequin commented Mar 3, 2017

    See https://reproducible-builds.org/ and https://reproducible-builds.org/docs/buy-in/ for why this is a good thing to have in general.

    Fedora, openSUSE and possibly other Linux distributions package .pyc files as part of their binary rpm packages and they are not trivial to drop [1].

    A .pyc header includes the timestamp of the source .py file
    which creates non-reproducible builds when the .py file is touched during build time (e.g. for a version.py).
    As of 2017-02-10 in openSUSE Factory this affected 476 packages (such as python-amqp and python3-Twisted).

    [1] http://lists.opensuse.org/opensuse-packaging/2017-02/msg00086.html

    @bmwiedemann bmwiedemann mannequin added stdlib Python modules in the Lib dir build The build process and cross-build labels Mar 3, 2017
    @ericvsmith
    Copy link
    Member

    --
    Eric.

    On Mar 3, 2017, at 6:36 AM, Bernhard M. Wiedemann <report@bugs.python.org> wrote:

    New submission from Bernhard M. Wiedemann:

    See https://reproducible-builds.org/ and https://reproducible-builds.org/docs/buy-in/ for why this is a good thing to have in general.

    Fedora, openSUSE and possibly other Linux distributions package .pyc files as part of their binary rpm packages and they are not trivial to drop [1].

    A .pyc header includes the timestamp of the source .py file
    which creates non-reproducible builds when the .py file is touched during build time (e.g. for a version.py).
    As of 2017-02-10 in openSUSE Factory this affected 476 packages (such as python-amqp and python3-Twisted).

    [1] http://lists.opensuse.org/opensuse-packaging/2017-02/msg00086.html

    ----------
    components: Build, Distutils
    messages: 288880
    nosy: bmwiedemann, dstufft, merwok
    priority: normal
    pull_requests: 353
    severity: normal
    status: open
    title: support reproducible Python builds
    versions: Python 2.7, Python 3.3, Python 3.4, Python 3.5, Python 3.6


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue29708\>



    New-bugs-announce mailing list
    New-bugs-announce@python.org
    https://mail.python.org/mailman/listinfo/new-bugs-announce

    @warsaw
    Copy link
    Member

    warsaw commented Mar 3, 2017

    Shouldn't this at least also cover Python 3.7? And should it be officially backported? I would think that if #296 gets accepted for 3.7, then distros that care can cherry pick it back into whatever versions they still support. It probably needn't be officially cherry picked upstream.

    (FWIW, this doesn't affect the Debian ecosystem since we don't ship pycs in debs.)

    @bmwiedemann
    Copy link
    Mannequin Author

    bmwiedemann mannequin commented Mar 4, 2017

    backports are optional.
    It can help reduce duplicated work for the various distributions.
    Currently, I think master and 2.7 are the most relevant targets.

    @bmwiedemann bmwiedemann mannequin added the 3.7 (EOL) end of life label Mar 4, 2017
    @benjaminp
    Copy link
    Contributor

    I have proposed PEP-552 to address this issue.

    @commodo
    Copy link
    Mannequin

    commodo mannequin commented Jan 2, 2018

    Hey,

    Allow me to join the discussion here.

    Context:

    • I'm the maintainer of Python & Python3 in the OpenWrt distro, and (since a while) we also care about reproducible builds.
    • The person [Alexander Couzens] who's leading the effort for OpenWrt, has pinged me about Python(3) and packages [to see about making them reproducible]
    • In OpenWrt we *only* ship .pyc files, because of performance considerations [.pyc can be 10x faster than .py on some SoCs], and size limitation [we cannot allow auto .pyc generation since it can be expensive on RAM [ < 32 MB systems ] or flash [ ~8 MB sizes ] ; believe it or not, people run Python on something like this

    Current status:

    References:

    I wanted to share my [and our] interest in this.

    If we can help in any way, feel free to ping.

    I will try to hack/patch some more stuff in the current Python releases to make them fully reproducible [for us], and probably share the results here.
    When PEP-552 gets implemented and there will be a Python we will switch to them.
    Atm, in trunk we package Python 2.7.14 & Python 3.6.4

    Thanks
    Alex

    @benjaminp
    Copy link
    Contributor

    PEP-552 has been implemented for 3.7.

    @commodo
    Copy link
    Mannequin

    commodo mannequin commented Jan 3, 2018

    Thank you for the heads-up.
    I did not follow-up too in-depth on the resolution.

    I just stumbled over this last night.

    Will keep an eye for 3.7, and see about 2.7.

    @brettcannon
    Copy link
    Member

    A disagreement has popped up over what the ideal solution is on the PR currently connected to this issue. I'm having the folks involved switch it over to here.

    IMO I think py_compile can respect SOURCE_DATE_EPOCH and just blindly use it for creating .pyc files. That way builds are reproducible. Yes, it will quite possibly lead to those .pyc files being regenerated the instant Python starts running, but SOURCE_DATE_EPOCH is entirely about builds, not runtimes. Plus .pyc files are just optimizations and so it is not critical they not be regenerated again later.

    @eli-schwartz
    Copy link
    Mannequin

    eli-schwartz mannequin commented Jan 14, 2018

    So, a couple of things.

    It seems to me, that properly supporting SOURCE_DATE_EPOCH means using exactly that and nothing else. To that end, I'm not entirely sure why things like --clamp-mtime even exist, as the original timestamp of a source file doesn't seem to have a lot of utility and it is better to be entirely predictable. But I'm not going to argue that, except insomuch as it seems IMHO to fit better for python to just keep things simple and override the timestamp with the value of SOURCE_DATE_EPOCH

    That being said, I see two problems with python implementing something analogous to --clamp-mtime rather than just --mtime.

    1. Source files are extracted by some build process, and remain untouched. Python generates bytecode pinned to the original time, rather than SOURCE_DATE_EPOCH. Later, the build process packages those files and implements --mtime, not --clamp-mtime. Because Python and the packaging software disagree about which one to use, the bytecode fails.

    2. Source files are extracted, and the build process even tosses all timestamps to the side of the road, by explicitly touching all of them to the date of SOURCE_DATE_EPOCH just in case. Then for whatever reason (distro patches, 2to3, the use of cp) the timestamps get updated to $currentime. But SOURCE_DATE_EPOCH is in the future, so the timestamps get downdated. Python bytecode is generated by emulating --clamp-mtime. The build process then uses --mtime to package the files. Again, because Python and the packaging software disagree about which one to use, the bytecode fails.

    Of course, in both those cases, blindly respecting SOURCE_DATE_EPOCH will seemingly break everything for people who use --clamp-mtime instead. I'm not happy with reproducible-builds.org for allowing either one.

    I don't think python should rely on --mtime users manually overriding the filesystem metadata of the source files outside of py_compile, as that is a hack that I think we'd like to remove if possible... that being said, Arch Linux will, on second thought, not be adversely affected even if py_compile tries to be clever and emulate --clamp-mtime to decide on its own whether to respect SOURCE_DATE_EPOCH.

    Likewise, I don't really expect people to try to reproduce builds using a future date for SOURCE_DATE_EPOCH. On the other hand, the reproducible builds spec doesn't forbid it AFAICT.

    But... neither of those mitigations seem "clean" to me, for the reasons stated above.

    There is something that would solve all these issues, though. From reading the importlib code (I haven't actually tried smoketesting actual imports), it appears that Python 2 accepts any bytecode that is dated at or later than the timestamp of its source .py, while Python 3 requires the timestamps to perfectly match. This seems bizarre to behave differently, especially as until @bmwiedemann mentioned it on the GitHub PR I blindly assumed that Python would not care if your bytecode is somehow dated later than your sources. If the user is playing monkey games with mismatched source and byte code, while backdating the source code to *trick* the interpreter into loading it... let them? They can break their stuff if they want to!

    On looking through the commit logs, it seems that Python 3 used to do the same, until 61b1425 refactored the general vicinity and modified this behavior without warning. In a commit that seems to be designed to do something else entirely. This really should have been two separate commits, and modifying the import code to more strictly check the timestamp should have come with an explanatory justification. Because I cannot think of a good reason for this behavior, and the commit isn't giving me an opportunity to understand either. As it is, I am completely confused, and have no idea whether this was even supposed to be deliberate.
    In hindsight it is certainly preventing nice solutions to supporting SOURCE_DATE_EPOCH.

    @brettcannon
    Copy link
    Member

    As Eli's comments are coming off as negative to/at me, I feel like I have
    to defend myself here. If you look at the commit there was actually two
    places where the timestamp was checked; one did an equality comparison and
    one did a >= comparison. It's quite possible the semantics accidentally
    changed as part of the refactoring due to the check being done in different
    places and a different one was copied, although no one has even noticed
    until now.

    If there is a desire to change the semantics of how timestamps are checked
    then that should be done in a separate issue as at this point we have lived
    with the current semantics for several releases -- all releases of Python 3
    still receiving security updates -- so it's passed being a bug and is now
    the semantics in Python 3.

    On Sat, Jan 13, 2018, 16:57 Eli Schwartz, <report@bugs.python.org> wrote:

    Eli Schwartz eschwartz93@gmail.com added the comment:

    So, a couple of things.

    It seems to me, that properly supporting SOURCE_DATE_EPOCH means using
    exactly that and nothing else. To that end, I'm not entirely sure why
    things like --clamp-mtime even exist, as the original timestamp of a source
    file doesn't seem to have a lot of utility and it is better to be entirely
    predictable. But I'm not going to argue that, except insomuch as it seems
    IMHO to fit better for python to just keep things simple and override the
    timestamp with the value of SOURCE_DATE_EPOCH

    That being said, I see two problems with python implementing something
    analogous to --clamp-mtime rather than just --mtime.

    1. Source files are extracted by some build process, and remain untouched.
      Python generates bytecode pinned to the original time, rather than
      SOURCE_DATE_EPOCH. Later, the build process packages those files and
      implements --mtime, not --clamp-mtime. Because Python and the packaging
      software disagree about which one to use, the bytecode fails.

    2. Source files are extracted, and the build process even tosses all
      timestamps to the side of the road, by explicitly touching all of them to
      the date of SOURCE_DATE_EPOCH just in case. Then for whatever reason
      (distro patches, 2to3, the use of cp) the timestamps get updated to
      $currentime. But SOURCE_DATE_EPOCH is in the future, so the timestamps get
      downdated. Python bytecode is generated by emulating --clamp-mtime. The
      build process then uses --mtime to package the files. Again, because Python
      and the packaging software disagree about which one to use, the bytecode
      fails.

    Of course, in both those cases, blindly respecting SOURCE_DATE_EPOCH will
    seemingly break everything for people who use --clamp-mtime instead. I'm
    not happy with reproducible-builds.org for allowing either one.

    I don't think python should rely on --mtime users manually overriding the
    filesystem metadata of the source files outside of py_compile, as that is a
    hack that I think we'd like to remove if possible... that being said, Arch
    Linux will, on second thought, not be adversely affected even if py_compile
    tries to be clever and emulate --clamp-mtime to decide on its own whether
    to respect SOURCE_DATE_EPOCH.

    Likewise, I don't really expect people to try to reproduce builds using a
    future date for SOURCE_DATE_EPOCH. On the other hand, the reproducible
    builds spec doesn't forbid it AFAICT.

    But... neither of those mitigations seem "clean" to me, for the reasons
    stated above.

    There is something that would solve all these issues, though. From reading
    the importlib code (I haven't actually tried smoketesting actual imports),
    it appears that Python 2 accepts any bytecode that is dated at or later
    than the timestamp of its source .py, while Python 3 requires the
    timestamps to perfectly match. This seems bizarre to behave differently,
    especially as until @bmwiedemann mentioned it on the GitHub PR I blindly
    assumed that Python would not care if your bytecode is somehow dated later
    than your sources. If the user is playing monkey games with mismatched
    source and byte code, while backdating the source code to trick the
    interpreter into loading it... let them? They can break their stuff if they
    want to!

    On looking through the commit logs, it seems that Python 3 used to do the
    same, until
    61b1425
    refactored the general vicinity and modified this behavior without warning.
    In a commit that seems to be designed to do something else entirely. This
    really should have been two separate commits, and modifying the import code
    to more strictly check the timestamp should have come with an explanatory
    justification. Because I cannot think of a good reason for this behavior,
    and the commit isn't giving me an opportunity to understand either. As it
    is, I am completely confused, and have no idea whether this was even
    supposed to be deliberate.
    In hindsight it is certainly preventing nice solutions to supporting
    SOURCE_DATE_EPOCH.

    ----------
    nosy: +eschwartz


    Python tracker <report@bugs.python.org>
    <https://bugs.python.org/issue29708\>


    @bmwiedemann
    Copy link
    Mannequin Author

    bmwiedemann mannequin commented Jan 15, 2018

    I think, there is no single nice and clean solution with time-based .pyc files, but to get a whole distribution to build reproducibly, there are two other ways:

    1. if the SOURCE_DATE_EPOCH environment variable is set,
      make hash-based .pyc files the default.

    2. instead of storing .py mtime in the .pyc header, use the .pyc's filesystem mtime value - also making it more available to users.
      Not sure if this would have side-effects or cause regressions.

    on the side-issue: IMHO checking exact mtimes is the right thing to do, because sometimes users will copy back old .py files and expect mismatching .pyc files to not be used.

    @brettcannon
    Copy link
    Member

    Bernhard's idea of SOURCE_DATE_EPOCH being an implicit envvar to forcibly switch on hash-based .pyc files in py_compile is intriguing. I assume this would force the check_source bit to be set? Or since SOURCE_DATE_EPOCH should only be used in build scenarios would you want UNCHECKED_HASH?

    As the core dev who seems the most engaged and willing to commit this, I'm willing to make the final decision on this and commit the final PR. I see the options of getting this into 3.7 as the following:

    1. SOURCE_DATE_EPOCH acts as an environment variable flag to forcibly generate hash-based .pyc files with the check_source bit set in py_compile and compileall
    2. SOURCE_DATE_EPOCH is used to specifically set the timestamp in .pyc files in py_compile and compileall

    That's it. No clamping, no changing how timestamp-based .pyc files are invalidated, no touching source files, etc.

    If this is going to make it into Python 3.7 then a decision must be made by Friday, Jan 19, so have your opinions on those two options in before then (and in the case of the hash-based solution, would you expect CHECKED_HASH or UNCHECKED_HASH?). At that point I will make a decision and Bernhard can either update his PR or I can create a new one forked from his(I leave that up to Bernhard based on the decision I'll make on/by Friday).

    @brettcannon brettcannon self-assigned this Jan 15, 2018
    @warsaw
    Copy link
    Member

    warsaw commented Jan 15, 2018

    On Jan 15, 2018, at 11:31, Brett Cannon <report@bugs.python.org> wrote:

    1. SOURCE_DATE_EPOCH acts as an environment variable flag to forcibly generate hash-based .pyc files with the check_source bit set in py_compile and compileall
    2. SOURCE_DATE_EPOCH is used to specifically set the timestamp in .pyc files in py_compile and compileall

    I’d suggest that if SDE is set to an integer, that is used as the timestamp. If it’s set to a special symbol (e.g. ‘hash’) then the hash is used. I’m not volunteering to write the code though. :)

    @brettcannon
    Copy link
    Member

    Since Barry chose an option that wasn't listed, I'm planning on accepting Bernhard's #5200 at some point next week barring any new, unique objections.

    @brettcannon
    Copy link
    Member

    New changeset ccbe581 by Brett Cannon (Bernhard M. Wiedemann) in branch 'master':
    bpo-29708: Setting SOURCE_DATE_EPOCH forces hash-based .pyc files (GH-5200)
    ccbe581

    @brettcannon
    Copy link
    Member

    Just merged Bernhard's PR which forces hash-based .pyc files. Thanks to everyone who constructively helped reach this point.

    @brettcannon
    Copy link
    Member

    New changeset cab0b2b by Brett Cannon in branch 'master':
    bpo-29708: Add What's New entries for SOURCE_DATE_EPOCH and py_compile (GH-5306)
    cab0b2b

    @commodo
    Copy link
    Mannequin

    commodo mannequin commented Jan 25, 2018

    Hey,

    Sorry, if I'm a bit late to the party with this.
    The road to reproducible builds has a few more steps.

    The way I validate whether Python is reproducible is with this link:
    https://tests.reproducible-builds.org/lede/lede_ar71xx.html

    There is a need to also patch getbuildinfo.c to make Python reproducible.

    I have opened a PR for this : #5313

    I've waited for the periodic build to trigger on that reproducible page.
    In OpenWrt, the packages to look for [that is affected by this getbuildinfo.c patch] are python-base & python3-base.

    There are still some python3 packages that need patching.
    Seems that python3-asyncio, pydoc, and some other pyc files need investigation.
    I'll check.
    Maybe this isn't an issue in 3.7.

    Alex

    @brettcannon brettcannon reopened this Jan 25, 2018
    @brettcannon brettcannon removed their assignment Jan 30, 2018
    @bmwiedemann
    Copy link
    Mannequin Author

    bmwiedemann mannequin commented Jan 31, 2018

    Any chance we can get the (somewhat related) patch for https://bugs.python.org/issue30693 also merged?

    @WillThompson
    Copy link
    Mannequin

    WillThompson mannequin commented Mar 6, 2018

    For what it's worth, in Endless OS we still saw slight variations between builds in the .pyc files, even with all the source files' mtimes set to the epoch (ie. equivalent to setting & supporting SOURCE_DATE_EPOCH, I believe). Looking at the contents of the file suggested it was just reordering of class fields; indeed, we only saw this on Python versions where hash randomization is enabled by default, and disabling hash randomization made the output reproducible.

    @commodo
    Copy link
    Mannequin

    commodo mannequin commented Mar 6, 2018

    Yeah, I also see it with 3.6.4.
    I wanted to try 3.7 to see if it's fixed by chance.

    Otherwise I may have to start digging deep into compilation logic.

    Looking here:
    https://tests.reproducible-builds.org/lede/lede_ar71xx.html

    More specifically here:
    https://tests.reproducible-builds.org/lede/dbd/packages/mips_24kc/packages/python3-asyncio_3.6.4-5_mips_24kc.ipk.html
    it looks like 2 byte-codes are inverted

    build1: 00007f80:​·​0100·​003e·​0200·​0000·​72b6·​0000·​0072·​b500·​·​.​.​.​>.​.​.​.​r.​.​.​.​r.​.​
    build2: 00007f80:​·​0100·​003e·​0200·​0000·​72b5·​0000·​0072·​b600·​·​.​.​.​>.​.​.​.​r.​.​.​.​r.​.​

    72b6 and 72b5 like to swap positions sometimes.

    @methane
    Copy link
    Member

    methane commented Mar 7, 2018

    00007f80:​·​0100·​003e·​0200·​0000·​72b6·​0000·​0072·​b500·​·​.​.​.​>.​.​.​.​r.​.​.​.​r.​.​
    vs
    00007f80:​·​0100·​003e·​0200·​0000·​72b5·​0000·​0072·​b600·​·​.​.​.​>.​.​.​.​r.​.​.​.​r.​.​

    3e 02 00 00 00 is frozenset(size=2)
    72 b6/b5 00 00 00 is reference to b5 or b6

    So it seems set order changed. (or items in the set is appearance order is changed.)
    Did you set PYTHONHASHSEED?

    Anyway, I think Python 3.7 can't guarantee "reproducible" compile because marshal uses reference count.

    @methane methane added 3.10 only security fixes and removed 3.9 only security fixes labels Dec 31, 2020
    @vstinner
    Copy link
    Member

    note the optimized .pyc is deterministic. As far as I know only __debug__ is set to False, or is there something else different?

    Hum, maybe there is a misunderstanding on the PEP-552 purpose.

    I understood that the main point of the PEP-552 is to compare hash(<source code>), rather than checking the .py and .pyc file modification time.

    It doesn't magically make the PYC file content fully reproducible. Correct me if I misunderstood PEP-552 as well.

    @benjaminp
    Copy link
    Contributor

    PEP-552 was a necessary but not sufficient step on the road towards fully deterministic pycs. The PEP says: "(Note there are other problems [1] [2] we do not address here that can make pycs non-deterministic.)" where [1] and [2] are basically the issues Inada-san has linked.

    @zooba
    Copy link
    Member

    zooba commented Feb 3, 2021

    This doesn't seem to necessarily impact distutils, so I'm leaving it open despite PEP-632.

    @zooba zooba removed the stdlib Python modules in the Lib dir label Feb 3, 2021
    @obfusk
    Copy link
    Mannequin

    obfusk mannequin commented Apr 22, 2021

    Hi! I've been working on reproducible builds for python-for-android [1,2,3].

    Current issues with .pyc files are:

    • .pyc files differ depending on whether Python was compiled w/ liblzma-dev installed or not;
    • many .pyc files include build paths;
    • some .pyc files include paths to system utilities, like /bin/mkdir or /usr/bin/install, which can differ between systems (e.g. on Debian w/ merged /usr).

    [1] kivy/python-for-android#2390
    [2] https://lists.reproducible-builds.org/pipermail/rb-general/2021-January/002132.html
    [3] https://lists.reproducible-builds.org/pipermail/rb-general/2021-March/002207.html

    @AraHaan
    Copy link
    Contributor

    AraHaan commented May 28, 2022

    For me, I find that manually rewriting my py files in C using the C python api made my builds reproducible (I think) as well as faster.

    But luckily C is not the only option now, with .NET becoming better at it, I have made hacks to use .NET Core in Python for cross platform python extensions that do not need to be rebuilt based on the OS 🥳 (however this is slow because the resulting binary is not crossgen'd for the OS's specific cpu).

    Another good thing with .NET is that then I could use version checking of python before calling newer python C api functions that might not exist in the version they are using 🎉.

    Another cool thing I find with the standalone windows version is that the python standard library is pyc compiled and placed in a zip file, it would be nice if all distributions switched to that so that way they the option in the installers "compile the standard library" could be removed as they would already be compiled and in a zip file which could speed up python on all of it's distributions in that way (which I think could help make them reproducible).

    @zhuofeng6
    Copy link

    zhuofeng6 commented Jun 1, 2022

    #93317 The bug I encountered was not caused by the timestamp, I have tools that control timestamps to keep timestamps consistent.

    if i turn off compiling of multithreading, it don't appear again. I think it's caused by these characters, these strings are ”{“ ”{{“ ”}“ ”}“

    image

    image

    There may be DA/FA differences in some pyc files in the python standard library. If two pythons with differences in the standard library pyc are used to compile the same py file containing the "{" string, the pyc generated will also have DA/FA differences. This is because some pyc files in the standard library will be automatically loaded when the python -m compileall xxx command is executed. When the pyc file is loaded, it recognizes that TYPE_SHORT_ASCII_INTERNED will intern the corresponding string, and marshal.c can see this operation , so when new pyc is produced, "{" will also be written as TYPE_SHORT_ASCII_INTERNED type.

    @methane
    Copy link
    Member

    methane commented Jun 1, 2022

    Thank you for reporting. It is other source of undeterministic pyc as you said.

    Single codepoint strings (and empty string) are shared among interpreter.
    "{" is not name, so compile() doesn't intern it. But since it is singleton, same instance is shared from many places.

    Most simple solution is interning all singletons in _PyRuntimeState_Init.

    @zhuofeng6
    Copy link

    Thank you for reporting. It is other source of undeterministic pyc as you said.

    Single codepoint strings (and empty string) are shared among interpreter. "{" is not name, so compile() doesn't intern it. But since it is singleton, same instance is shared from many places.

    Most simple solution is interning all singletons in _PyRuntimeState_Init.

    Do you have any plans to fix this bug in the near future?

    @methane
    Copy link
    Member

    methane commented Jun 6, 2022

    Do you have any plans to fix this bug in the near future?

    I will fix it before 3.12 become beta if no one fix it until then.

    @AraHaan
    Copy link
    Contributor

    AraHaan commented Jun 6, 2022

    I wish 3.12 was 4.0 so that way the 2to3 would get nuked. But most likely 4.0 will land in 2050.

    @vstinner
    Copy link
    Member

    vstinner commented Jun 6, 2022

    I wish 3.12 was 4.0 so that way the 2to3 would get nuked. But most likely 4.0 will land in 2050.

    2to3 is deprecated since Python 3.11 and so can be removed in Python 3.13 (or later): #84540 It's a problem with the parser implementation in lib2to3 which doesn't support Python 3.10 grammar.

    @AraHaan
    Copy link
    Contributor

    AraHaan commented Jun 12, 2022

    What if the entire python standard libraries was to be rewritten in C? I just thought of this and I think it would help eliminate this issue as well as make the interpreter faster because then it can be further optimized.

    As to where to place the resulting code in why not place it in the pythonX.Y.dll file/ the equivalent file on non-windows to avoid any extra files as well (pyd files).

    Alternatively python could build in cython into it’s compile() function under python to compile scripts into pyd’s. However the generated c code would have to be deleted to not waste a ton of disk space.

    @ericvsmith
    Copy link
    Member

    What if the entire python standard libraries was to be rewritten in C?

    This is such an enormous task that it will never happen. Plus it would make CPython harder to maintain, and make it harder to share anything in the stdlib with other implementations.

    @arhadthedev
    Copy link
    Member

    What if the entire python standard libraries was to be rewritten in C

    Lots of pointer management boilerplate, plus Python operators will inflate into lenghty function calls.

    @arhadthedev
    Copy link
    Member

    Alternatively python could build in cython into it’s compile() function under python to compile scripts into pyd’s. However the generated c code would have to be deleted to not waste a ton of disk space.

    Using Cython as an optional feature enabled in configure and PCBuild looks promising.

    However, we need a person who is ready to add the lines into configure.ac, Makefile.pre.in, PCBuild/get_externals.bat, relevant Visual Studio projects, and possibly submit necessary patches to Cython to support such a use case.

    @zhuofeng6
    Copy link

    If compiled with PGO twice, it will also produce inconsistencies, and there are many different places.

    What's the reason for this

    @methane
    Copy link
    Member

    methane commented Jun 17, 2022

    If compiled with PGO twice, it will also produce inconsistencies, and there are many different places.

    I don't understand what you are saying.
    Please provide concrete and complete step to reproduce.

    @mcepl
    Copy link
    Contributor

    mcepl commented Aug 1, 2022

    If compiled with PGO twice, it will also produce inconsistencies, and there are many different places.

    I don't understand what you are saying. Please provide concrete and complete step to reproduce.

    I don’t understand why #93317 didn’t get linked here.

    @merwok
    Copy link
    Member

    merwok commented Aug 1, 2022

    It was, but the UI is not very obvious: #73894 (reference)

    @compete2cooperate
    Copy link

    I am a novice on python compilation and trying determininstic pyc. As simplest use case, I cloned git repo with a sample .py file in two places (src1 and src2) inside the same build environment (docker container) and compiled both of them using invalidation_mode=CHECKED_HASH and SOURCE_DATE_EPOCH set. But I can see that hash of corresponding .pyc files are different. (source file inside both src1 and src2 have same hash but different file attributes notably mtime)

    How can I compile source file in src1 and src2 so that correponding pyc have same hash? I am using python version 3.10.12. My apologies if this is not the right place to ask.

    @mcepl
    Copy link
    Contributor

    mcepl commented Feb 26, 2024

    On Mon Feb 26, 2024 at 5:58 PM CET, compete2cooperate wrote:

    How can I compile source file in src1 and src2 so that correponding pyc have same hash? I am using python version 3.10.12. My apologies if this is not the right place to ask.

    Look at patches we, openSUSE, have in our packages. Particularly relevant are I believe distutils-reproducible-compile.patch, gh-78214-marshal_stabilize_FLAG_REF.patch, and bpo-37596-make-set-marshalling.patch. Perhaps also python-3.3.0b1-fix_date_time_compiler.patch (I am not sure whether that one is not actually obsolete).

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes build The build process and cross-build
    Projects
    None yet
    Development

    No branches or pull requests