Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up marshal.loads() #63418

Closed
pitrou opened this issue Oct 10, 2013 · 40 comments
Closed

speed up marshal.loads() #63418

pitrou opened this issue Oct 10, 2013 · 40 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@pitrou
Copy link
Member

pitrou commented Oct 10, 2013

BPO 19219
Nosy @warsaw, @brettcannon, @jcea, @ronaldoussoren, @pitrou, @kristjanvalur, @vstinner, @tiran, @serhiy-storchaka
Files
  • marshal_opts4.patch
  • marshal_opts5.patch
  • marshal_opts5a.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2013-10-12.20:31:16.574>
    created_at = <Date 2013-10-10.19:49:11.990>
    labels = ['interpreter-core', 'performance']
    title = 'speed up marshal.loads()'
    updated_at = <Date 2013-10-15.10:31:09.573>
    user = 'https://github.com/pitrou'

    bugs.python.org fields:

    activity = <Date 2013-10-15.10:31:09.573>
    actor = 'jcea'
    assignee = 'none'
    closed = True
    closed_date = <Date 2013-10-12.20:31:16.574>
    closer = 'pitrou'
    components = ['Interpreter Core']
    creation = <Date 2013-10-10.19:49:11.990>
    creator = 'pitrou'
    dependencies = []
    files = ['32037', '32046', '32049']
    hgrepos = []
    issue_num = 19219
    keywords = ['patch']
    message_count = 40.0
    messages = ['199408', '199425', '199451', '199454', '199455', '199456', '199457', '199459', '199462', '199463', '199464', '199465', '199467', '199468', '199469', '199470', '199471', '199472', '199473', '199474', '199475', '199476', '199477', '199478', '199479', '199480', '199481', '199482', '199483', '199484', '199485', '199486', '199495', '199504', '199505', '199508', '199617', '199618', '199645', '199702']
    nosy_count = 10.0
    nosy_names = ['barry', 'brett.cannon', 'jcea', 'ronaldoussoren', 'pitrou', 'kristjan.jonsson', 'vstinner', 'christian.heimes', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue19219'
    versions = ['Python 3.4']

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 10, 2013

    This patch contains assorted improvements for unmarshalling pyc files. It will also make them ~10% smaller.

    $ ./python -m timeit -s "import marshal; d=marshal.dumps(tuple((i, str(i)) for i in range(1000)))" "marshal.loads(d)"

    -> 3.4 unpatched: 232 usec per loop
    -> 3.4 patched: 96.3 usec per loop
    -> 2.7 (for reference): 76.5 usec per loop

    @pitrou pitrou added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Oct 10, 2013
    @vstinner
    Copy link
    Member

    Why adding ASCII strings, whereas you can add Latin1 (UCS1, U+0000-U+00FF) strings?

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 11, 2013

    Why adding ASCII strings, whereas you can add Latin1 (UCS1, U+0000-U+00FF)
    strings?

    Reasons:

    • most strings in pyc files are pure ASCII (it's like 99% in the stdlib)
    • unmarshalling ASCII strings is faster: you can pass 127 to PyUnicode_New without scanning for non-ASCII chars

    The aim here is to optimize the common cases. There is no reason to further complicate the code for rare cases.

    @vstinner
    Copy link
    Member

    unmarshalling ASCII strings is faster: you can pass 127 to PyUnicode_New without scanning for non-ASCII chars

    Oh, I forgot this pain of the PEP-393. Don't tell me more, it's enough :-)

    @kristjanvalur
    Copy link
    Mannequin

    kristjanvalur mannequin commented Oct 11, 2013

    I imagine that the test for ASCII is cheaper. It corresponds to the new compact internal unicode representation (one byte characters).

    This looks fine. Can you quantify where the speedup comes from? Reading the code, I see we now maintain a small internal buffer in the file object, rather than using stack allocation at the call sites. It is unclear to me how this saves memory, since the amount of memory copying should be the same. Could it be that the speedup is all due to the native 8 bit support for unicode?

    Have you looked at providing a special opcode for a few other magic numbers?(We have that in our own custom marshal format)
    Some very common values are:
    empty tuple
    0
    1

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 11, 2013

    This looks fine. Can you quantify where the speedup comes from?

    From all changes, but mainly the ASCII special-casing and the new
    buffering.

    Reading the code, I see we now maintain a small internal buffer in
    the file object, rather than using stack allocation at the call
    sites. It is unclear to me how this saves memory, since the amount
    of memory copying should be the same.

    No, memory copying is suppressed in many cases.

    Have you looked at providing a special opcode for a few other magic
    numbers?(We have that in our own custom marshal format)
    Some very common values are:
    empty tuple
    0
    1

    It shouldn't be useful since marshal memoizes them anyway: only the
    first appearance in a pyc file would benefit.

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 11, 2013

    > Reading the code, I see we now maintain a small internal buffer in
    > the file object, rather than using stack allocation at the call
    > sites. It is unclear to me how this saves memory, since the amount
    > of memory copying should be the same.

    No, memory copying is suppressed in many cases.

    To clarify: the import logic uses marshal.loads(), not marshal.load().
    So the aim is really to speed up unmarshalling from memory.

    @kristjanvalur
    Copy link
    Mannequin

    kristjanvalur mannequin commented Oct 11, 2013

    Right, in this case, memory copying is avoided.

    Regarding the memoing of 0, empty tuple, etc:
    Special opcode may still benefit because it takes only one byte. So, you save four bytes for each such case. I think it might be worth investigating.

    @serhiy-storchaka
    Copy link
    Member

    • unmarshalling ASCII strings is faster: you can pass 127 to PyUnicode_New without scanning for non-ASCII chars

    You should ensure that loaded bytes are ASCII-only. Otherwise broken or malicious marshalled data will compromise you program. Decoding UTF-8 is so fast as decoding ASCII (with checks) and is almost so fast as memcpy.

    As for output, we could use cached UTF-8 representation of string (always exists for ASCII only strings) before calling PyUnicode_AsUTF8String().

    I'm good with buffering and codes for short strings and tuples (I have not examined a code closely yet), but special casing ASCII looks not so good to me.

    @vstinner
    Copy link
    Member

    "You should ensure that loaded bytes are ASCII-only. Otherwise broken or malicious marshalled data will compromise you program."

    This is not new, see the red warning in marshal doc:

    """
    Warning

    The marshal module is not intended to be secure against erroneous or maliciously constructed data. Never unmarshal data received from an untrusted or unauthenticated source.
    """

    @kristjanvalur
    Copy link
    Mannequin

    kristjanvalur mannequin commented Oct 11, 2013

    We have to make two distinctions here:

    1. Loading data and then running it. This is a bad idea if your data is not trusted. This is what is meant by "marshal" being unsafe.
    2. Loading data and then not running it. This is perfectly fine, because marshal has _no side effects_ when loading. Only actually _running_ untrusted data is what you should be careful about. In fact, using 'marshal' as a cheap and fast pickler for builtin types is actually a good idea because it has no side effects like invoking code. (and I think the comment you refer to should be revised to make this clear)

    So, will simply load ASCII data that is, in fact, not ASCII data, destabilize your program in any way? Or even crash it? If that is true, then we have a problem.

    @vstinner
    Copy link
    Member

    "As for output, we could use cached UTF-8 representation of string (always exists for ASCII only strings) before calling PyUnicode_AsUTF8String()."

    PyUnicode_AsEncodedString(v, "utf8", "surrogatepass") is expensive. I proposed an optimization for the pickle module, Antoine finished the work: see issue bpo-15596. It's exactly what you suggest: reuse PyUnicode_AsUTF8String().

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 11, 2013

    In fact, using 'marshal' as a cheap and fast pickler for builtin types
    is actually a good idea because it has no side effects like invoking
    code.

    It's an unsupported use case. The marshal docs are quite clear:

    """Therefore, the Python maintainers reserve the right to modify
    the marshal format in backward incompatible ways should the need
    arise. If you’re serializing and de-serializing Python objects,
    use the pickle module instead [...]"""

    So, it's a "good idea" as long as you're willing to deal with the
    consequences :-)

    @vstinner
    Copy link
    Member

    marshal and pickle are unsafe, even without the patch attached to the issue. If you consider that it is an issue that should be fixed, please open a new issue. Antoine's patch doesn't make the module less secure, since it was already not secure :)

    Loading untrusted data and executing untrusted code is not supported by Python. Many things should be fixed to support such use case, not only the marshal module. I'm interested by the topic (I wrote the pysandbox project, which is first try), but please discuss it elsewhere.

    @vstinner
    Copy link
    Member

    please discuss it elsewhere.

    Hum, I'm not sure that this word exist, I mean: somethere else :-)

    @kristjanvalur
    Copy link
    Mannequin

    kristjanvalur mannequin commented Oct 11, 2013

    "Therefore, the Python maintainers reserve the right to modify
    the marshal format in backward incompatible ways"

    sure, don't expect such things to survive version changes. (Actually, they have been hitherto, and my version "3" I actually changed, to be so, the initial draft being unable to read version 2 data)
    But for sending stuff over the wire, caching on disk, etc, its perfectly safe and super fast. And what is more, the python developers (that's you and me) are super careful to never _crash_ the interpreter, no matter how broken any input data is.

    @kristjanvalur
    Copy link
    Mannequin

    kristjanvalur mannequin commented Oct 11, 2013

    "Loading untrusted data ... is not supported by Python."
    This is a pretty bold claim. Is this, indeed, a fact? (and yes, "elsewhere" is a word)

    @kristjanvalur
    Copy link
    Mannequin

    kristjanvalur mannequin commented Oct 11, 2013

    Anyway, whether or not Pyhon guarantees this and that wrt. "untrusted" is beside the point and offtopic, as Victor poitns out.

    However: We are, and have always been, careful to fail gracefully if we detect data corruption. Never should the flipping of a bit on a file on the disk cause our program to crash. It is fine when reading a corrupt file that an int(1) turns to int(2), or that we get an UnmarshalError when reading it, or that a "hello world" string turns to "hello $orld". What is not good is if the reading of the corrupt string causes the interpreter to crash.

    My knowledge of the new unicode internals is limited at best. If you don't think, Antoine, that putting non-7-bit data into the supposedly 7 bit ascii unicode data can cause an actual crash, but at worst a corrupt string, then I'm quite happy, personally :)

    @serhiy-storchaka
    Copy link
    Member

    The marshal module is not intended to be secure against erroneous or maliciously constructed data. Never unmarshal data received from an untrusted or unauthenticated source.

    Then we can simplify the marshal module by dropping all error handling: f.read() returned not bytes, read() returned too much data, EOF read where not expected, recursion limit exceeded, long/string/unicode/tuple/list/set size out of range, unnormalized long data, digit out of range in long, index list too large, invalid reference, unknown type code, NULL object in marshal data for set, UTF8 decoding errors, string to float converting errors, etc, etc. Sorry for sarcasm.

    It's exactly what you suggest: reuse PyUnicode_AsUTF8String().

    Actually _PyUnicode_UTF8(). PyUnicode_AsUTF8String() creates UTF8 cache if it is not exists and this can be not desired. We could use this optimization in many other places, in particular in PyUnicode_AsUTF8String() itself.

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 11, 2013

    Sorry for sarcasm.

    Well, indeed, the sarcasm is undeserved here, if the interpreter cannot
    crash because of the change.

    > It's exactly what you suggest: reuse PyUnicode_AsUTF8String().

    Actually _PyUnicode_UTF8(). PyUnicode_AsUTF8String() creates UTF8
    cache if it is not exists and this can be not desired. We could use
    this optimization in many other places, in particular in
    PyUnicode_AsUTF8String() itself.

    I don't understand how _PyUnicode_UTF8() can be used for *unmarshalling*.

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 11, 2013

    That said, I'll try out the patch with _PyUnicode_FromUCS1 instead of _PyUnicode_FromASCII, to see if it affects performance.

    @kristjanvalur
    Copy link
    Mannequin

    kristjanvalur mannequin commented Oct 11, 2013

    Just for the record, I want to say that this is great stuff, Antoine! It's great when this sort of stuff gets some attention.

    @serhiy-storchaka
    Copy link
    Member

    I don't understand how _PyUnicode_UTF8() can be used for *unmarshalling*.

    I say about marshalling.

    That said, I'll try out the patch with _PyUnicode_FromUCS1 instead of _PyUnicode_FromASCII, to see if it affects performance.

    Could you try out the patch with PyUnicode_DecodeUTF8()? This will save you two opcodes and perhaps several lines of code.

    @kristjanvalur
    Copy link
    Mannequin

    kristjanvalur mannequin commented Oct 11, 2013

    "This will save you two opcodes and perhaps several lines of code. "
    Just bear in mind that without other changes, version 4 needs to be backwards compatible with version 3.
    I ran into this when developing version 3. The reason is that while the marshal format includes the version information in its header, it isn't actually verified on loading. IIRC. You specify the expected format to the function, or something like that. So, if you don't do this, you get errors when loading previously generated .pyc files.
    Of course, we are free to fix that problem as well :)

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 11, 2013

    > That said, I'll try out the patch with _PyUnicode_FromUCS1 instead
    > of _PyUnicode_FromASCII, to see if it affects performance.

    Could you try out the patch with PyUnicode_DecodeUTF8()? This will
    save you two opcodes and perhaps several lines of code.

    That would be a change in behaviour, since currently "surrogatepass"
    is the error handler.

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 11, 2013

    I ran into this when developing version 3. The reason is that while
    the marshal format includes the version information in its header,
    it isn't actually verified on loading. IIRC. You specify the
    expected format to the function, or something like that. So, if you
    don't do this, you get errors when loading previously generated .pyc
    files.

    Exactly (I also tried this :-)). The problem is the version number
    is *outside* of the marshal format: e.g. it's in the pyc file header.
    When freezing a module (see frozen.c), the version number isn't included.
    So our freedom is quite limited here: we have to support legacy frozen
    modules.

    @kristjanvalur
    Copy link
    Mannequin

    kristjanvalur mannequin commented Oct 11, 2013

    How about adding a version opcode? This is a backwards compatible change, and allows us to reject unsupported versions in the future, as long as they are not very old unsupported versions.

    @serhiy-storchaka
    Copy link
    Member

    "This will save you two opcodes and perhaps several lines of code. "
    Just bear in mind that without other changes, version 4 needs to be backwards compatible with version 3.

    I meant two of new proposed opcodes: TYPE_ASCII and TYPE_ASCII_INTERNED.

    @serhiy-storchaka
    Copy link
    Member

    That would be a change in behaviour, since currently "surrogatepass"
    is the error handler.

    I don't propose any change in behaviour.

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 11, 2013

    I meant two of new proposed opcodes: TYPE_ASCII and
    TYPE_ASCII_INTERNED.

    You cannot change the meaning of TYPE_UNICODE (it uses "surrogatepass").
    Therefore, you have to use new opcodes with other semantics.

    Besides, opcodes are cheap.

    @serhiy-storchaka
    Copy link
    Member

    You cannot change the meaning of TYPE_UNICODE (it uses "surrogatepass").
    Therefore, you have to use new opcodes with other semantics.

    You don't need new semantic. Use old semantic. The set of ASCII encoded strings is a subset of valid UTF8 encoded strings, which is a subset of UTF8 encoded with "surrogatepass" strings.

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 11, 2013

    > You cannot change the meaning of TYPE_UNICODE (it uses
    > "surrogatepass").
    > Therefore, you have to use new opcodes with other semantics.

    You don't need new semantic. Use old semantic. The set of ASCII
    encoded strings is a subset of valid UTF8 encoded strings, which is
    a subset of UTF8 encoded with "surrogatepass" strings.

    Sorry, I don't understand you.

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 11, 2013

    Updated patch using PyUnicode_FromKindAndData for ASCII strings, to ensure that corrupt marshal bytecode doesn't provide corrupt unicode objects. Performance is within 2% of the previous patch.

    (however, a quick test suggests that PyUnicode_DecodeUTF8 is quite slower)

    @serhiy-storchaka
    Copy link
    Member

    Let a code say instead me.

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 11, 2013

    Let a code say instead me.

    I don't understand your patch. The macros you define aren't used
    anywhere.
    Also, your patch keeps the slow call to PyUnicode_DecodeUTF8.

    @vstinner
    Copy link
    Member

    (however, a quick test suggests that PyUnicode_DecodeUTF8 is quite slower)

    It's surprising that PyUnicode_DecodeUTF8() is quite slower than _PyUnicode_FromUCS1(). _PyUnicode_FromUCS1() calls ucs1lib_find_max_char() and then memcpy(). PyUnicode_DecodeUTF8() first tries ascii_decode() which is very similar than ucs1lib_find_max_char().

    The difference is maybe that _PyUnicode_FromUCS1() copies all bytes at once (memcpy()), whereas ascii_decode() copies bytes while if the string is ASCII or not.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 12, 2013

    New changeset 4059e871e74e by Antoine Pitrou in branch 'default':
    Issue bpo-19219: Speed up marshal.loads(), and make pyc files slightly (5% to 10%) smaller.
    http://hg.python.org/cpython/rev/4059e871e74e

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 12, 2013

    I've now committed the latest patch (marshal_opts5.patch).

    @pitrou pitrou closed this as completed Oct 12, 2013
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 13, 2013

    New changeset 2a2b339b6b59 by Christian Heimes in branch 'default':
    Issue bpo-19219: retval may be used uninitialized value
    http://hg.python.org/cpython/rev/2a2b339b6b59

    @kristjanvalur
    Copy link
    Mannequin

    kristjanvalur mannequin commented Oct 13, 2013

    In this case, we can remove a bunch of 'retval = NULL' from the code.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants