Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support object instancing and recursion in marshal #60679

Closed
kristjanvalur mannequin opened this issue Nov 15, 2012 · 47 comments
Closed

Support object instancing and recursion in marshal #60679

kristjanvalur mannequin opened this issue Nov 15, 2012 · 47 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@kristjanvalur
Copy link
Mannequin

kristjanvalur mannequin commented Nov 15, 2012

BPO 16475
Nosy @loewis, @gpshead, @amauryfa, @mdickinson, @pitrou, @kristjanvalur, @tiran, @benjaminp, @ezio-melotti, @serhiy-storchaka
Files
  • marshalinstance.patch
  • marshalinstance.patch
  • marshalinstance.patch
  • marshalinstance.patch
  • marshalinstance.patch
  • marshalstat.py: Quick and dirty marshal statistics tool
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2014-04-04.11:47:25.028>
    created_at = <Date 2012-11-15.08:54:38.446>
    labels = ['interpreter-core', 'type-feature']
    title = 'Support object instancing and recursion in marshal'
    updated_at = <Date 2014-04-04.13:25:37.673>
    user = 'https://github.com/kristjanvalur'

    bugs.python.org fields:

    activity = <Date 2014-04-04.13:25:37.673>
    actor = 'kristjan.jonsson'
    assignee = 'none'
    closed = True
    closed_date = <Date 2014-04-04.11:47:25.028>
    closer = 'kristjan.jonsson'
    components = ['Interpreter Core']
    creation = <Date 2012-11-15.08:54:38.446>
    creator = 'kristjan.jonsson'
    dependencies = []
    files = ['27987', '27988', '28047', '28048', '28052', '28055']
    hgrepos = []
    issue_num = 16475
    keywords = ['patch']
    message_count = 47.0
    messages = ['175603', '175606', '175632', '175672', '175690', '175695', '175799', '175823', '175955', '175959', '175962', '175963', '175964', '175981', '175982', '175983', '175987', '175988', '175989', '175998', '176003', '176005', '176008', '176011', '176013', '176014', '176015', '176016', '176017', '176018', '176057', '184727', '184738', '184739', '184740', '184754', '184780', '184785', '184809', '184868', '184876', '185142', '185143', '185156', '185193', '185198', '185274']
    nosy_count = 11.0
    nosy_names = ['loewis', 'gregory.p.smith', 'amaury.forgeotdarc', 'mark.dickinson', 'pitrou', 'kristjan.jonsson', 'christian.heimes', 'benjamin.peterson', 'ezio.melotti', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue16475'
    versions = ['Python 3.4']

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Nov 15, 2012

    The format used by the marshal module does not support instancing. This precludes certain data optimizations, such as sharing string constants, common tuples, even common code objects. Since the marshal format is used to write compiled code, this makes it impossible to do data optimization on code prior to writing it out.

    This patch adds proper instancing support for all the supported types and increases the default version to three.

    (A separate defect/regression is that interned strings are no longer preserved as was implemented in version 1 of the original 2.x branch. This also negates any interning done at compile time. That is a separate defect.)

    @kristjanvalur kristjanvalur mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Nov 15, 2012
    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Nov 15, 2012

    Second patch which adds the missing internment support for strings, including unittests.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 15, 2012

    marshal is only supposed to be used to serialize code objects, not arbitrary user data. Why don't you use pickle?

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Nov 16, 2012

    This change is specifically aimed at code objects.
    As it is, it is impossible to produce code objects that share common data (e.g. filename strings, common tuples, name strings, etc) that don't unserialize to separate objects.

    Also, separately but related (see second patch) the effort spent in interning names when compiling, is lost when code objects are loaded back from disk.

    This change is based on work done at CCP to reduce the size of compiled code in memory. Simple preprocessing of code objects prior to writing them to disk can result in important memory savings.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 16, 2012

    This change is specifically aimed at code objects.
    As it is, it is impossible to produce code objects that share common
    data (e.g. filename strings, common tuples, name strings, etc) that
    don't unserialize to separate objects.

    Shouldn't strings be interned when the code object is unmarshalled?

    This change is based on work done at CCP to reduce the size of
    compiled code in memory. Simple preprocessing of code objects prior
    to writing them to disk can result in important memory savings.

    How important?

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Nov 16, 2012

    Basically, reuse of strings (and preservation of their internment status) fell by the wayside somewhere in the 3.x transition. Strings have been reused, and interned strings re-interned, since protocol version 1 in 2.x. This patch adds that feature back, and uses that mechanism to reuse not only strings, but also any other multiply-referenced object.

    It is not desirable to simply intern all strings that are read from marshaled data. Only selected strings are interned by python during compilation and we want to keep it that way. Also, 2.x reuses not only interned strings but other strings as well.

    Generalizing reuse of strings to other objects is trivial, and a logical step forward. This allows optimizations to be made on code objects where common data are identified and instanced, and those code objects to be saved and reloaded with that instancing intact.

    But even without such code-object optimization, the changes are significant:
    The sizes of the marshaled code object of lib/test/test_marshal drops from 24093 bytes in version 2 to 17841 bytes with version 3, without any additional massaging of the module code object.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 17, 2012

    I agree that restoring the string interning behaviour would be a good thing.
    However, I don't agree that supporting recursive objects and instantiation is useful. marshal is specialized for code objects, and you shouldn't find any recursive constants there.

    As for the size of pyc files, who cares? Memory footprint may be useful to shrink (especially for cache efficiency reasons), but I don't see why we should try to reduce the size of on-disk bytecode. And if we do, it would probably be simpler to zlib-compress them.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Nov 17, 2012

    When I added interning support to marshal, I specifically cared about the size of pyc. I find it sad that this support was thrown out, so I support restoring it.

    I'm also skeptical about general sharing, and would like to see some specific numbers pointing out the gain of such a mechanism (compared to a version that merely preserves interned strings).

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Nov 19, 2012

    If you have string sharing, adding support for general sharing falls automatically out without any effort. There is no reason _not_ to support it, in other words.
    Marshal may be primarily used for .pyc files but it is not the only usage. It is a very fast and powerful serializer for data that is not subject to the overhead or safety concerns of the general pickle protocol. This is illustrated by the following code (2.7):

    case TYPE_CODE:
            if (PyEval_GetRestricted()) {
                PyErr_SetString(PyExc_RuntimeError,
                    "cannot unmarshal code objects in "
                    "restricted execution mode");
    Obviously, this shows that marshal is still expected to work and be useful even if not for pickling code objects.

    It is good to know that you care about the size of the .pyc files, Martin. But we should bear in mind that this size difference is directly reflected in the memory use of the loaded data. A reduction by 25% of the .pyc size is roughly equivalent to a 25% memory use reduction by the loaded code object.

    I haven't produced data about the savings of general object reuse because it relies on my "recode" code optimizer module which is still work in progress. However, I will do some tests and let you know. Suffice to say that it is enormously frustrating to re-generate code objects with an optimization tool, sharing common or identical sub-objects and so on, and then finding that the marshal module undoes all of that.

    I'll report back with additional figures.

    @serhiy-storchaka
    Copy link
    Member

    There is no many sense to use references for TYPE_INT whose representation size not greater then a reference representation size. I doubt about references to mutable objects.

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Nov 19, 2012

    Ok, I did some tests with my recode module. The following are the sizes of the marshal data:

    test2To3 ... 24748 24748 212430 212430
    test3To3 ... 18420 17848 178969 174806
    test4To3 ... 18425 18411 178969 178550

    The columns:
    a) test_marshal.py without transform
    b) test_marshal.py with recode.intern() (folding common objects)
    c) and d): decimal.py module (the largest one in lib)

    The lines:

    1. Version 2 of the protocol.
    2. Version 3 of the protocol (object instancing and the works)
    3. Version 4, an dummy version that only instances strings)

    As expected, there is no difference between version 3 and 4 unless I employ the recode module to fold common subobjects. This brings an additional saving of some 3% bringing the total reduction up to 28% and
    18% respectively.

    Note that the transform is a simple recursive folding of objects. common argument lists, such as (self) are subject to this. No renaming of local variables or other stripping is performed.
    So, although the "recode" module is work in progress, and not the subject of this "defect", its use shows how it is important to be able to support proper instancing in serialization protocols.

    Implementation note: The trick of using a bit flag on the type to indicate a slot reservation in the instance list is one that has been in use in CCP´s own "Marshal" format, a proprietary serialization format based on marshal back in 2002 (adding many more special opcodes and other stuff)

    Serhiy: There is no reason _not_ to reuse INT objects if we are doing it for other immutables to. As you note, the size of the data is the same. This will ensure that integers that are not cached can be folded into the same object, e.g. the value 123, if used in two functions, can be the same int object.

    I should also point out that the marshal protocol takes care to be able to serialize lists, sets and frozensets correctly, the latter being added in version 2.4. This despite the fact that code objects don't make use of these.

    @serhiy-storchaka
    Copy link
    Member

    The following are the sizes of the marshal data:

    Can you please measure the time of unmarshalling? It would be interesting. If you can count the statistics about marshalled types (what percent of shared and non shared integers, strings, etc), it would also be very interesting.

    There is no reason _not_ to reuse INT objects if we are doing it for other immutables to.

    There is at least one reason. This increases size of the refs table.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 19, 2012

    I should also point out that the marshal protocol takes care to be
    able to serialize lists, sets and frozensets correctly, the latter
    being added in version 2.4. This despite the fact that code objects
    don't make use of these.

    Code objects do use frozensets:

    >>> def f(x):
    ...     return x in {1,2,3,4,5,6}
    ... 
    >>> dis.dis(f)
      2           0 LOAD_FAST                0 (x) 
                  3 LOAD_CONST               7 (frozenset({1, 2, 3, 4, 5, 6})) 
                  6 COMPARE_OP               6 (in) 
                  9 RETURN_VALUE         

    I don't think marshal supports any type that isn't (or hasn't been)
    used in code objects.

    Obviously, this shows that marshal is still expected to work and be
    useful even if not for pickling code objects.

    The module officially intended for general-purpose serialization is
    pickle; if you use marshal for such a purpose, you're on your own.
    If you think pickle is not good enough, your improvements are welcome.

    As expected, there is no difference between version 3 and 4 unless
    I employ the recode module to fold common subobjects. This brings
    an additional saving of some 3% bringing the total reduction up to
    28% and 18% respectively.

    3% doesn't sound like a worthwhile improvement at all.

    The trick of using a bit flag on the type to indicate a slot
    reservation in the instance list is one that has been in use in
    CCP´s own "Marshal" format, a proprietary serialization format
    based on marshal back in 2002 (adding many more special opcodes
    and other stuff)

    Why don't you release your "proprietary marshal" on pypi? You would
    get feedback and a sense of whether people are interested in your
    approach.

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Nov 20, 2012

    Antoine, I understand that _you_ may not see any need for object references in marshal streams. Also, I am not going to try to convince you it is a good idea, since I have long figured out that you are against any contributions from me on some sort of principle.

    However, even if you cannot agree that it is a good idea, can you explain to me how it is a BAD idea? How can expanding object references to strings to all objects, using the same mechanism, be bad? How can it be bad to make the marshal format more complete with minimal effort? Keep in mind that this change removes a number of warnings and caveats present both in the documentation and the in-line comments.

    @mdickinson
    Copy link
    Member

    I have long figured out that you are against any contributions from
    me on some sort of principle.

    I suspect that Antoine's principles have very little to do with *who* the contributions originate from, and much more to do with the content of those contributions.

    We've all got the same overarching goal of improving and maintaining the quality of Python. Please can we not make this personal?

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Nov 20, 2012

    Serhiy, to answer your questions:

    Can you please measure the time of unmarshalling?
    Sure. Normally, I wouldn't consider time to be important here because for code objects, unmarshaling is done only once per run However, testing it is simple:
    I added timeit(number=100) for unmarshaling of the different version code.
    0.09614 0.09605 0.582672 0.553599
    0.05590 0.05628 0.293370 0.285517
    0.05703 0.05754 0.294601 0.292070
    The lines are version 2, 3, and 4(*) respectively. The columns are test_marshal.py, recode.intern(test_marshal.py), decimal.py and recode.intern(decimal.py)

    As you see, loading time is almost halfed with version 3 and 4 compared to 2. Version 3 is also slightly faster than 4
    (*)verion 4 is a "special" version that only instances strings.

    If you can count the statistics about marshalled types
    (what percent of shared and non shared integers, strings, etc),
    it would also be very interesting.
    That's more tricky, at least on a type=by-type basis, but I could do a global object count. Later.

    There is at least one reason. This increases size of the refs table.
    I checked this, by printing out the size of the instance list when loads() was done:
    457 1571
    297 1163
    429 1539
    The columns are test_marshal.py and decimal.py
    the lines are version 3, version 4( only strings ) and special version5 which is like 3 but omits ints.
    As you see, the ints correspond to roughly 6% and 2% of the instances respectively. The bulk of the list is taken up by strings (65% and 74%)

    This shows that adding instancing of all other types on top of the strings does not typically expand the instance list more than 50%

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Nov 20, 2012

    New patch, incorporating suggested fixes from review.

    @serhiy-storchaka
    Copy link
    Member

    Thank you, Kristján, for the statistics. It makes your proposition more attractive.

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Nov 20, 2012

    New patch with changes.

    @serhiy-storchaka
    Copy link
    Member

    Personally I don't like the use of macros inn this code. I think that without them the code would be clearer.

    If anyone is interested, here are the statistics for all the standard modules (Lib/__pycache__/*.pyc).

    UNICODE 105248 61%
    TUPLE 33193 19%
    STRING 13527 7.8%
    INT 7373 4.3%
    CODE 6438 3.7%
    NONE 5291 3.1%
    TRUE 474 0.27%
    FALSE 401 0.23%
    BINARY_FLOAT 340 0.2%
    LONG 60 0.035%
    FROZENSET 20 0.012%

    Strings (unicode and bytes), tuples, short ints, code objects and None in sum are 99% of all objects. Mutable collections, complex numbers, Ellipsis and StopIteration are not used at all.

    If size of compiled modules is a problem, we can get about 10% by using more compact representation for sizes (1- or 2-bytes). This requires additional codes for strings and collections (at least for unicode strings and tuples).

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Nov 20, 2012

    Changes as suggested by Serhiy

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Nov 20, 2012

    The size of the .pyc files is secondary. The size that is important is the memory footprint of loaded code objects, which can be done by stripping and folding code objects.
    This works springs out of work for embedding python on the PS3 console where every byte counts.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 20, 2012

    However, even if you cannot agree that it is a good idea, can you
    explain to me how it is a BAD idea? How can expanding object
    references to strings to all objects, using the same mechanism, be
    bad?

    It is a bad idea because features have to be supported in the long-term,
    which means more maintenance effort. So, basically, this is the same
    reason we don't accept every feature request + patch that gets posted
    to the tracker.

    And, again, I think the string interning part is a good thing.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 20, 2012

    By the way, please follow PEP-8 in test_marshal.py.

    @serhiy-storchaka
    Copy link
    Member

    Here is the statistics for all pyc-files (not only in Lib/__pycache__). This includes encoding tables and tests. I count also memory usage for some types (for tuples shared size is estimated upper limit).

    type count % size shared %

    UNICODE 622812 58% 26105085 14885090 57%
    TUPLE 224214 21% 8184848 3498300 43%
    STRING 90992 8.4% 6931342 832224 12%
    INT 52087 4.8% 715400 58666 8.2%
    CODE 42147 3.9% 2865996 0 0%
    NONE 39777 3.7%
    BINARY_FLOAT 3120 0.29%
    TRUE 2363 0.22%
    FALSE 1976 0.18%
    LONG 1012 0.094%
    ELLIPSIS 528 0.049%
    BINARY_COMPLEX 465 0.043%
    FROZENSET 24 0.0022%

    Total 1081517 100% 44802671 19274280 ~43%

    Therefore there is a sense to share unicode objects, tuples, and may be bytes objects. Most integers (in range -5..257) already interned. None of code objects can be shared (because code object contains almost unique first line number). Floats, complexes and frozensets unlikely save much of memory.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 20, 2012

    Here is the statistics for all pyc-files (not only in
    Lib/__pycache__). This includes encoding tables and tests. I count
    also memory usage for some types (for tuples shared size is estimated
    upper limit).

    Did you examine the sharing per file or among all files?

    @serhiy-storchaka
    Copy link
    Member

    Per file. With total sharing:

    UNICODE 622812 58% 26105085 18262484 70%
    TUPLE 224214 21% 8184848 4007404 49%
    STRING 90992 8.4% 6931342 1361618 20%
    INT 52087 4.8% 715400 95666 13%
    CODE 42147 3.9% 2865996 0 0%

    @serhiy-storchaka
    Copy link
    Member

    Total size of all *.pyc files is 22 MB.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Nov 20, 2012

    Am 20.11.12 17:32, schrieb Kristján Valur Jónsson:

    The size of the .pyc files is secondary.

    This really depends on whom you ask. When I did the string interning
    support, the primary concern *was* for the size of the pyc files, and
    there *was* a real project where it mattered (namely, code size in
    a frozen application). It may be secondary to *you* and *now*.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Nov 20, 2012

    Am 20.11.12 18:02, schrieb Antoine Pitrou:

    It is a bad idea because features have to be supported in the long-term,
    which means more maintenance effort. So, basically, this is the same
    reason we don't accept every feature request + patch that gets posted
    to the tracker.

    For marshal, this actually is of less concern - we are free to change it
    whenever we please (and people actually did change it when they
    pleased).

    Of course, there still must be a demonstrated gain, and that must be
    significant enough to justify the size of the change (in diffstat).

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Nov 21, 2012

    Code objects can indeed be shared.
    One thing that the "recode" module does, or allows you to do, is to strip file and line number information from code objects. This will theoretically allow them to be collapsed.

    Martin, I agree the .pyc size matters. You are right, priorities vary. I am mainly focused on memory use, while others may be looking at disk use. Disk use can of course be reduced by using tools like zip. And code objects can be re-optimized at load time too using special importers. But it is nice to be able to achieve both objectives by enabling the marshal format to preserve those optimizations that are performed on it prior to saving it.

    I'm currently working on the recode module. When its done, I'll report back and share it with you so that you can toy around with it.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 20, 2013

    New changeset 01372117a5b4 by Kristján Valur Jónsson in branch 'default':
    Issue bpo-16475: Support object instancing, recursion and interned strings
    http://hg.python.org/cpython/rev/01372117a5b4

    @benjaminp
    Copy link
    Contributor

    I don't understand some of this code. Why does r_ref_reserve take a first parameter which it just returns on success without using?

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Mar 20, 2013

    I thought I had explained this already, but can't find it, so here is the explanation:
    It matches the pattern of r_ref(), which semantically is a combination of r_ref_register() and r_ref_insert().

    It is a convenence calling pattern because these functions will either:
    a) pass a NULL operand through,
    b) succeed and pass the non-NULL operatn through
    c) fail, and release the operand and return NULL.
    In all cases, it is sufficient to look at the value of the operand after this transformation to know if everything was allright. This simplifies all the calling sites where these things are used, e.g. by the R_REF() macro.

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Mar 20, 2013

    I should add comments explaining this to the file.

    @ezio-melotti
    Copy link
    Member

    I'm getting two failures after this:
    ======================================================================
    ERROR: testRaising (test.test_exceptions.ExceptionTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/wolf/dev/py/py3k/Lib/test/test_exceptions.py", line 51, in testRaising
        marshal.loads(b'')
    ValueError: bad marshal data (unknown type code)

    and

    ======================================================================
    ERROR: test_no_marshal (test.test_importlib.source.test_file_loader.SourceLoaderBadBytecodeTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/wolf/dev/py/py3k/Lib/test/test_importlib/source/util.py", line 23, in wrapper
        to_return = fxn(*args, **kwargs)
      File "/home/wolf/dev/py/py3k/Lib/test/test_importlib/source/test_file_loader.py", line 364, in test_no_marshal
        self._test_no_marshal()
      File "/home/wolf/dev/py/py3k/Lib/test/test_importlib/source/test_file_loader.py", line 265, in _test_no_marshal
        self.import_(file_path, '_temp')
      File "/home/wolf/dev/py/py3k/Lib/test/test_importlib/source/test_file_loader.py", line 194, in import_
        module = loader.load_module(module_name)
      File "<frozen importlib._bootstrap>", line 572, in _check_name_wrapper
      File "<frozen importlib._bootstrap>", line 1032, in load_module
      File "<frozen importlib._bootstrap>", line 1013, in load_module
      File "<frozen importlib._bootstrap>", line 548, in module_for_loader_wrapper
      File "<frozen importlib._bootstrap>", line 869, in _load_module
      File "<frozen importlib._bootstrap>", line 990, in get_code
      File "<frozen importlib._bootstrap>", line 668, in _compile_bytecode
    ValueError: bad marshal data (unknown type code)

    ======================================================================
    ERROR: test_no_marshal (test.test_importlib.source.test_file_loader.SourcelessLoaderBadBytecodeTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/wolf/dev/py/py3k/Lib/test/test_importlib/source/test_file_loader.py", line 470, in test_no_marshal
        self._test_no_marshal(del_source=True)
      File "/home/wolf/dev/py/py3k/Lib/test/test_importlib/source/test_file_loader.py", line 265, in _test_no_marshal
        self.import_(file_path, '_temp')
      File "/home/wolf/dev/py/py3k/Lib/test/test_importlib/source/test_file_loader.py", line 194, in import_
        module = loader.load_module(module_name)
      File "<frozen importlib._bootstrap>", line 1099, in load_module
      File "<frozen importlib._bootstrap>", line 548, in module_for_loader_wrapper
      File "<frozen importlib._bootstrap>", line 869, in _load_module
      File "<frozen importlib._bootstrap>", line 1105, in get_code
      File "<frozen importlib._bootstrap>", line 668, in _compile_bytecode
    ValueError: bad marshal data (unknown type code)

    @pitrou
    Copy link
    Member

    pitrou commented Mar 20, 2013

    Indeed, would be nice to fix the test failures.

    Besides, it would be extra nice if you could run the test suite *before* pushing your changes. Otherwise you're wasting everyone else's time.

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Mar 20, 2013

    This should not have happened and it was indeed all tested. I'll investigate why these errors are happening.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 20, 2013

    New changeset f4c21179690b by Kristján Valur Jónsson in branch 'default':
    Issue bpo-16475: Simplify the interface to r_ref_allocate and improve comments.
    http://hg.python.org/cpython/rev/f4c21179690b

    New changeset 42bf74b90626 by Kristján Valur Jónsson in branch 'default':
    Issue bpo-16475 : Correctly handle the EOF when reading marshal streams.
    http://hg.python.org/cpython/rev/42bf74b90626

    @ezio-melotti
    Copy link
    Member

    FTR the two failures I saw earlier are now gone.

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Mar 21, 2013

    Yes, they were fixed with #42bf74b90626 which also added unittests in test_marshal.py to make sure invalid EOFs are always caught.

    @amauryfa
    Copy link
    Member

    Sorry, what does "instancing" mean?
    And does this change bring interesting features?
    And is there an impact on regular .pyc files?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 24, 2013

    Sorry, what does "instancing" mean?

    He means "keeping track of instance identities", so that objects
    that were shared before marshal continue to be shared after loading.

    And does this change bring interesting features?

    "interesting" to whom?

    And is there an impact on regular .pyc files?

    Definitely. It may now contain 't' (interned) codes again, and it may contain 'r' (reference) codes. The size of the pyc files may decrease.

    @amauryfa
    Copy link
    Member

    The size of the pyc files may decrease

    This is very good news! Indeed, I noticed decimal.cpython-34.pyc going from 212k to 178k. 17% less!
    This is worth an entry in whatsnew/3.4.rst IMO.

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Mar 25, 2013

    Thanks, Martin, for clarifying this.
    Glad you are seeing improvements Amaury. In fact, large modules can get additional improvements by identifying identical generated constructs from the compiler, like tuples and strings. The compiler isn't very good at this to begin with. I should probably work some more on my "recode" module and make it public.

    I am unsure about how whatsnew is handled these days. Is it incrementally updated or managed by someone?

    @ezio-melotti
    Copy link
    Member

    I am unsure about how whatsnew is handled these days.
    Is it incrementally updated or managed by someone?

    It's better to add at least a stub to the whatsnew, even if someone will eventually go through it before the release.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 26, 2013

    New changeset 84e73ace3d7e by Kristjan Valur Jonsson in branch 'default':
    Issue bpo-16475: Add a whatsnew entry for 3.4
    http://hg.python.org/cpython/rev/84e73ace3d7e

    @kristjanvalur kristjanvalur mannequin closed this as completed Apr 4, 2014
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants