Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducible pyc: FLAG_REF is not stable. #78274

Closed
methane opened this issue Jul 11, 2018 · 19 comments
Closed

Reproducible pyc: FLAG_REF is not stable. #78274

methane opened this issue Jul 11, 2018 · 19 comments
Labels
3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@methane
Copy link
Member

methane commented Jul 11, 2018

BPO 34093
Nosy @vstinner, @benjaminp, @methane, @ericsnowcurrently, @serhiy-storchaka, @ctismer, @yan12125
PRs
  • gh-78214: marshal: Stabilize FLAG_REF usage #8226
  • bpo-34093: Stablize FLAG_REF usage (two-pass version) #8293
  • gh-78274: Produce consistent output from marshal.dumps(). #28379
  • Files
  • bm_marshal.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2018-07-11.10:36:33.741>
    labels = ['interpreter-core', 'type-bug', '3.11']
    title = 'Reproducible pyc: FLAG_REF is not stable.'
    updated_at = <Date 2021-09-20.15:21:12.255>
    user = 'https://github.com/methane'

    bugs.python.org fields:

    activity = <Date 2021-09-20.15:21:12.255>
    actor = 'eric.snow'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Interpreter Core']
    creation = <Date 2018-07-11.10:36:33.741>
    creator = 'methane'
    dependencies = []
    files = ['47683']
    hgrepos = []
    issue_num = 34093
    keywords = ['patch']
    message_count = 14.0
    messages = ['321435', '321448', '321521', '321523', '321524', '321527', '321528', '321529', '321611', '321622', '347970', '401979', '402236', '402244']
    nosy_count = 7.0
    nosy_names = ['vstinner', 'benjamin.peterson', 'methane', 'eric.snow', 'serhiy.storchaka', 'Christian.Tismer', 'yan12125']
    pr_nums = ['8226', '8293', '28379']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue34093'
    versions = ['Python 3.11']

    @methane methane added extension-modules C modules in the Modules dir 3.8 only security fixes labels Jul 11, 2018
    @methane
    Copy link
    Member Author

    methane commented Jul 11, 2018

    PR-8226 makes marshal two-pass. It may have small overhead.

    In case of compiling module, marshal performance is negligible.
    But how in other cases? Should this change optional?

    And should we backport this to Python 3.7?
    Or should distributors cherrypick this?

    @methane
    Copy link
    Member Author

    methane commented Jul 11, 2018

    marshal: Mean +- std dev: [master] 123 us +- 7 us -> [patched] 173 us +- 2 us: 1.41x slower (+41%)
    compile+marshal: Mean +- std dev: [master] 5.28 ms +- 0.02 ms -> [patched] 5.47 ms +- 0.34 ms: 1.04x slower (+4%)

    @serhiy-storchaka
    Copy link
    Member

    Look also at alternate patches for bpo-20416. Some of them can solve this problem for simple types. If they have better performance, using them for simple types could save a time. But this will complicate a code, and I'm not sure it is worth.

    @vstinner
    Copy link
    Member

    According to Serhiy Storchaka, currently marshal.dumps() writes frozenset in arbitrary order, and so frozenset serialization is not reproducible:
    https://mail.python.org/pipermail/python-dev/2018-July/154604.html

    @vstinner
    Copy link
    Member

    What is the time spent in marshal.dumps() at Python startup when Python has to create all .pyc files? For example "./python -c pass" in the master branch with no external dependency? My question is if the PR makes Python startup 5% slower or less than 1% slower.

    @methane
    Copy link
    Member Author

    methane commented Jul 12, 2018

    STINNER Victor <vstinner@redhat.com> added the comment:

    According to Serhiy Storchaka, currently marshal.dumps() writes frozenset in arbitrary order, and so frozenset serialization is not reproducible:
    https://mail.python.org/pipermail/python-dev/2018-July/154604.html

    PYTHONHASHSEED can be used to stable frozenset order.

    On the other hand, refcnt based approach is more unstable.
    Even when x is y, dumps(x) == dumps(y) is not guaranteed.

    @methane
    Copy link
    Member Author

    methane commented Jul 12, 2018

    STINNER Victor <vstinner@redhat.com> added the comment:

    What is the time spent in marshal.dumps() at Python startup when Python has to create all .pyc files? For example "./python -c pass" in the master branch with no external dependency? My question is if the PR makes Python startup 5% slower or less than 1% slower.

    When startup, Python does more than compile()+marshal.dumps().
    And as I wrote above, it makes compile()+marshal.dumps() only 4% slower.
    So startup must not be slower than 4%.

    Additionally, it happens only once if pyc can be writable.
    (I don't know if marshal.dumps() is called when open(cache_path, 'wb') failed)

    @vstinner
    Copy link
    Member

    So startup must not be slower than 4%.

    I know. But Python does more than compile()+dumps() at the first run. I'm curious if it is feasible to measure this cost. But it may be hard to get reliable benchmarks, since I expect that the difference will be very small, and I know very well that measuring Python startup is hard since it depends a lot of on the filesystem which is hard to measure.

    @ctismer
    Copy link
    Contributor

    ctismer commented Jul 13, 2018

    Why must this become slower?

    To my knowledge, many projects prefer marshal over pickle
    for suitable simple objects because it is
    so very fast. I would not throw that away:

    Would it not be easy to add a named optional keyword
    argument, like "stable=True"?

    @methane
    Copy link
    Member Author

    methane commented Jul 13, 2018

    Would it not be easy to add a named optional keyword
    argument, like "stable=True"?

    My pull request did it.

    But for now, I get hint on ML and overwrote my PR with another way: Use FLAG_REF for all interned strings.

    @vstinner
    Copy link
    Member

    According to Serhiy Storchaka, currently marshal.dumps() writes frozenset in arbitrary order, and so frozenset serialization is not reproducible: https://mail.python.org/pipermail/python-dev/2018-July/154604.html

    I created bpo-37596 "Reproducible pyc: frozenset is not serialized in a deterministic order" to track this issue.

    @ericsnowcurrently
    Copy link
    Member

    FYI, I unknowingly created a duplicate of this issue a few days ago, bpo-45186, and created a PR for it: #28379. Interestingly, while I did that PR independently, it has a lot in common with Inada-san's second PR.

    My interest here is in how frozen modules can be affected by this problem, particularly between debug and non-debug builds. See bpo-45020, where I'm working on freezing all the stdlib modules imported during startup.

    @ericsnowcurrently ericsnowcurrently added interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.11 only security fixes type-bug An unexpected behavior, bug, or error and removed extension-modules C modules in the Modules dir 3.8 only security fixes labels Sep 16, 2021
    @ericsnowcurrently
    Copy link
    Member

    It turns out that I don't need this after all (once I merged #72578 and bpo-45188 was resolved). That impacts how much time I have to spend on this, so I might not be able to pursue this further. That said, I think it is worth doing and the PR I have up mostly does everything we need here. So I'll see if I can follow this through. :)

    @ericsnowcurrently
    Copy link
    Member

    FWIW, I found a faster solution than calling w_object() twice.

    Currently the logic for w_ref() (used for each "complex" object) looks like this:

    • if ob_ref == 1
      • do not apply FLAG_REF
      • marshal normally
    • else if seen for the first time
      • apply FLAG_REF
      • marshal normally
    • otherwise
      • emit TYPE_REF
      • emit the ref index of the first instance

    The faster solution looks like this:

    • if seen for the first time
      • do not apply FLAG_REF
      • marshal normally
      • record the index of the type byte in the output stream
    • else if seen for a second time
      • apply FLAG_REF to the byte at the earlier-recorded position
      • emit TYPE_REF
      • emit the ref index of the first instance
    • otherwise
      • emit TYPE_REF
      • emit the ref index of the first instance

    While this is faster, there are two downsides: extra memory usage and it isn't practical when writing to a file. However, I don't think either is a significant problem. For the former, it can be mostly mitigated by using the negative values in WFILE.hashtable to store the type byte position. For the latter, "marshal.dump()" is already a light wrapper around "marshal.dump()" and for PyMarshal_WriteObjectToFile() we simply stick with the current unstable approach (or change it to do what "marshal.dump()" does).

    FYI, I mostly have that implemented in a branch, but am not sure when I'll get back to it.

    @encukou
    Copy link
    Member

    encukou commented Mar 19, 2024

    Since this is no longer needed by CPython, I recommend closing the issue (and PR).
    Full disclosure: I helped with a third-party project that implements the second pass: https://github.com/fedora-python/marshalparser
    IMO, having a second pass on PyPI is fine.

    I'll close next week if there are no objections.

    @vstinner
    Copy link
    Member

    Since this is no longer needed by CPython, I recommend closing the issue (and PR).

    Do you mean that the initial issue was fixed? PYC are now reproducible?

    @encukou
    Copy link
    Member

    encukou commented Mar 19, 2024

    No. I mean that they don't need to be reproducible. See Guido's comment: #28379 (comment)

    @vstinner
    Copy link
    Member

    Ok, but this issue is about making PYC files reproducible to reproducible builds.

    @methane: What do you think?

    @methane
    Copy link
    Member Author

    methane commented Mar 19, 2024

    Wheel doesn't contain pyc files. So Python ecosystem around PyPI doesn't need reproducible pyc.

    Full disclosure: I helped with a third-party project that implements the second pass: https://github.com/fedora-python/marshalparser
    IMO, having a second pass on PyPI is fine.

    I think its fine too.

    @encukou encukou closed this as not planned Won't fix, can't repro, duplicate, stale Mar 26, 2024
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants