classification
Title: Reproducible pyc: FLAG_REF is not stable.
Type: Stage:
Components: Extension Modules Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Christian.Tismer, benjamin.peterson, inada.naoki, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2018-07-11 10:36 by inada.naoki, last changed 2018-07-13 15:52 by inada.naoki.

Files
File name Uploaded Description Edit
bm_marshal.py inada.naoki, 2018-07-11 13:07
Pull Requests
URL Status Linked Edit
PR 8226 open inada.naoki, 2018-07-11 10:37
Messages (10)
msg321435 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-07-11 10:40
PR-8226 makes marshal two-pass.  It may have small overhead.

In case of compiling module, marshal performance is negligible.
But how in other cases?  Should this change optional?

And should we backport this to Python 3.7?
Or should distributors cherrypick this?
msg321448 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-07-11 13:07
marshal: Mean +- std dev: [master] 123 us +- 7 us -> [patched] 173 us +- 2 us: 1.41x slower (+41%)
compile+marshal: Mean +- std dev: [master] 5.28 ms +- 0.02 ms -> [patched] 5.47 ms +- 0.34 ms: 1.04x slower (+4%)
msg321521 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-07-12 06:17
Look also at alternate patches for issue20416. Some of them can solve this problem for simple types. If they have better performance, using them for simple types could save a time. But this will complicate a code, and I'm not sure it is worth.
msg321523 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-12 08:00
According to Serhiy Storchaka, currently marshal.dumps() writes frozenset in arbitrary order, and so frozenset serialization is not reproducible:
https://mail.python.org/pipermail/python-dev/2018-July/154604.html
msg321524 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-12 08:02
What is the time spent in marshal.dumps() at Python startup when Python has to create all .pyc files? For example "./python -c pass" in the master branch with no external dependency? My question is if the PR makes Python startup 5% slower or less than 1% slower.
msg321527 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-07-12 08:31
> STINNER Victor <vstinner@redhat.com> added the comment:
>
> According to Serhiy Storchaka, currently marshal.dumps() writes frozenset in arbitrary order, and so frozenset serialization is not reproducible:
> https://mail.python.org/pipermail/python-dev/2018-July/154604.html

PYTHONHASHSEED can be used to stable frozenset order.

On the other hand, refcnt based approach is more unstable.
Even when x is y, dumps(x) == dumps(y) is not guaranteed.
msg321528 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-07-12 08:34
> STINNER Victor <vstinner@redhat.com> added the comment:
>
> What is the time spent in marshal.dumps() at Python startup when Python has to create all .pyc files? For example "./python -c pass" in the master branch with no external dependency? My question is if the PR makes Python startup 5% slower or less than 1% slower.

When startup, Python does more than compile()+marshal.dumps().
And as I wrote above, it makes compile()+marshal.dumps() only 4% slower.
So startup must not be slower than 4%.

Additionally, it happens only once if pyc can be writable.
(I don't know if marshal.dumps() is called when open(cache_path, 'wb') failed)
msg321529 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-12 08:41
> So startup must not be slower than 4%.

I know. But Python does more than compile()+dumps() at the first run. I'm curious if it is feasible to measure this cost. But it may be hard to get reliable benchmarks, since I expect that the difference will be very small, and I know very well that measuring Python startup is hard since it depends a lot of on the filesystem which is hard to measure.
msg321611 - (view) Author: Christian Tismer (Christian.Tismer) * (Python committer) Date: 2018-07-13 13:52
Why must this become slower?

To my knowledge, many projects prefer marshal over pickle
for suitable simple objects because it is
so very fast. I would not throw that away:

Would it not be easy to add a named optional keyword
argument, like "stable=True"?
msg321622 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-07-13 15:52
> Would it not be easy to add a named optional keyword
> argument, like "stable=True"?

My pull request did it.

But for now, I get hint on ML and overwrote my PR with another way: Use FLAG_REF for all interned strings.
History
Date User Action Args
2018-07-13 15:52:08inada.naokisetmessages: + msg321622
2018-07-13 13:52:50Christian.Tismersetnosy: + Christian.Tismer
messages: + msg321611
2018-07-12 08:41:22vstinnersetmessages: + msg321529
2018-07-12 08:34:42inada.naokisetmessages: + msg321528
2018-07-12 08:31:29inada.naokisetmessages: + msg321527
2018-07-12 08:02:49vstinnersetmessages: + msg321524
2018-07-12 08:00:40vstinnersetnosy: + vstinner
messages: + msg321523
2018-07-12 06:17:08serhiy.storchakasetmessages: + msg321521
2018-07-12 05:29:02inada.naokisetnosy: + benjamin.peterson, serhiy.storchaka
2018-07-11 13:07:26inada.naokisetfiles: + bm_marshal.py

messages: + msg321448
2018-07-11 10:40:08inada.naokisetmessages: + msg321435
stage: patch review ->
2018-07-11 10:37:58inada.naokisetkeywords: + patch
stage: patch review
pull_requests: + pull_request7779
2018-07-11 10:37:20inada.naokilinkissue34033 dependencies
2018-07-11 10:36:33inada.naokicreate