Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unreproducible bytecode: set order depends on random seed for compiled bytecode #88016

Closed
FFY00 opened this issue Apr 14, 2021 · 5 comments
Closed
Labels
type-bug An unexpected behavior, bug, or error

Comments

@FFY00
Copy link
Member

FFY00 commented Apr 14, 2021

BPO 43850
Nosy @tiran, @benjaminp, @markshannon, @1st1, @FFY00
PRs
  • bpo-43850: write sets reproducibly in marshalled code #25411
  • Superseder
  • bpo-29708: support reproducible Python builds
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-04-15.00:33:16.349>
    created_at = <Date 2021-04-14.22:31:26.691>
    labels = ['type-bug']
    title = 'unreproducible bytecode: set order depends on random seed for compiled bytecode'
    updated_at = <Date 2021-04-15.00:33:16.348>
    user = 'https://github.com/FFY00'

    bugs.python.org fields:

    activity = <Date 2021-04-15.00:33:16.348>
    actor = 'benjamin.peterson'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-04-15.00:33:16.349>
    closer = 'benjamin.peterson'
    components = []
    creation = <Date 2021-04-14.22:31:26.691>
    creator = 'FFY00'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 43850
    keywords = ['patch']
    message_count = 5.0
    messages = ['391104', '391111', '391112', '391113', '391114']
    nosy_count = 5.0
    nosy_names = ['christian.heimes', 'benjamin.peterson', 'Mark.Shannon', 'yselivanov', 'FFY00']
    pr_nums = ['25411']
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '29708'
    type = 'behavior'
    url = 'https://bugs.python.org/issue43850'
    versions = []

    @FFY00
    Copy link
    Member Author

    FFY00 commented Apr 14, 2021

    Currently, the order of set or frozenset elements when saved to bytecode is dependent on the random seed. This breaks reproducibility.

    Example fail from an Arch Linux package: https://reproducible.archlinux.org/api/v0/builds/88454/diffoscope

    Let's take an example file, test_compile.py

    s = {
        'aaa',
        'bbb',
        'ccc',
        'ddd',
        'eee',
    }
    $ PYTHONHASHSEED=0 python -m compileall --invalidation-mode checked-hash test_compile.py
    $ mv __pycache__ __pycache__1
    $ PYTHONHASHSEED=1 python -m compileall --invalidation-mode checked-hash test_compile.py
    
    $ diff __pycache__/test_compile.cpython-39.pyc __pycache__1/test_compile.cpython-39.pyc
    Binary files __pycache__/test_compile.cpython-39.pyc and __pycache__1/test_compile.cpython-39.pyc differ
    
    $ diff <(xxd __pycache__/test_compile.cpython-39.pyc) <(xxd __pycache__1/test_compile.cpython-39.pyc)
    5,6c5,6
    < 00000040: 005a 0362 6262 5a03 6464 645a 0361 6161  .Z.bbbZ.dddZ.aaa
    < 00000050: 5a03 6363 635a 0365 6565 4e29 01da 0173  Z.cccZ.eeeN)...s

    00000040: 005a 0361 6161 5a03 6363 635a 0364 6464 .Z.aaaZ.cccZ.ddd
    00000050: 5a03 6565 655a 0362 6262 4e29 01da 0173 Z.eeeZ.bbbN)...s

    I believe the issue is in the marshall module. Particularly, this line[1]. My simple fix was to create a list from the set, sort it, and iterate over it instead.

    [1]

    while (_PySet_NextEntry(v, &pos, &value, &hash)) {

    @FFY00 FFY00 added type-bug An unexpected behavior, bug, or error labels Apr 14, 2021
    @FFY00
    Copy link
    Member Author

    FFY00 commented Apr 14, 2021

    I just realized my fix is wrong because list.sort does not handle different types. Similarly to other reproducibility fixes, how does skipping the item randomization when SOURCE_DATE_EPOCH is set sound?

    @FFY00
    Copy link
    Member Author

    FFY00 commented Apr 14, 2021

    Nevermind, AFAIK that depends on the hash seed, correct? So, the most viable option to me would be a sorting algorithm that could take type into account. Would that be an acceptable solution?

    @FFY00
    Copy link
    Member Author

    FFY00 commented Apr 15, 2021

    Sorry for the spam, I am trying to figure out the best option here, which is hard to do by myself.

    IMO it would be reasonable to create set objects with elements in the order they appear in the code, instead of based on the hash. I am not really sure where is the code responsible for this, and if there are any limitations preventing this from being implemented.

    So, my question are: Would you consider this reasonable? Is there anything I am missing?
    If there are no issues, could someone point me to the target code?

    @benjaminp
    Copy link
    Contributor

    Let's keep any discussion on the preëxisting issue for this.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants