Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support unknown formats in memoryview comparisons #59778

Closed
skrah mannequin opened this issue Aug 7, 2012 · 63 comments
Closed

Support unknown formats in memoryview comparisons #59778

skrah mannequin opened this issue Aug 7, 2012 · 63 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-bug An unexpected behavior, bug, or error

Comments

@skrah
Copy link
Mannequin

skrah mannequin commented Aug 7, 2012

BPO 15573
Nosy @loewis, @birkenfeld, @mdickinson, @ncoghlan, @pitrou, @vstinner, @tiran, @skrah, @meadori
Files
  • format.c
  • issue15573.diff
  • abstractcmp.diff
  • issue15573-struct.diff: full struct module support for comparisons
  • issue15573-struct-2.diff
  • memoryobject.c.gcov
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/skrah'
    closed_at = <Date 2012-08-29.17:09:27.233>
    created_at = <Date 2012-08-07.11:55:30.289>
    labels = ['interpreter-core', 'type-bug', 'release-blocker']
    title = 'Support unknown formats in memoryview comparisons'
    updated_at = <Date 2012-11-02.17:07:02.101>
    user = 'https://github.com/skrah'

    bugs.python.org fields:

    activity = <Date 2012-11-02.17:07:02.101>
    actor = 'python-dev'
    assignee = 'skrah'
    closed = True
    closed_date = <Date 2012-08-29.17:09:27.233>
    closer = 'skrah'
    components = ['Interpreter Core']
    creation = <Date 2012-08-07.11:55:30.289>
    creator = 'skrah'
    dependencies = []
    files = ['26720', '26725', '26766', '26800', '26831', '26992']
    hgrepos = []
    issue_num = 15573
    keywords = ['patch']
    message_count = 63.0
    messages = ['167618', '167621', '167655', '167687', '167838', '167844', '167855', '167856', '167860', '167862', '167884', '167892', '167899', '167902', '167908', '167911', '167912', '167930', '167946', '167981', '167982', '167984', '167987', '167990', '167991', '167993', '167994', '168000', '168002', '168003', '168004', '168006', '168010', '168028', '168029', '168031', '168032', '168188', '168202', '168257', '168301', '168440', '168559', '168560', '168564', '169023', '169085', '169106', '169107', '169108', '169109', '169110', '169111', '169112', '169113', '169114', '169115', '169347', '169355', '169388', '169394', '169399', '174541']
    nosy_count = 11.0
    nosy_names = ['loewis', 'georg.brandl', 'mark.dickinson', 'ncoghlan', 'pitrou', 'vstinner', 'christian.heimes', 'Arfrever', 'skrah', 'meador.inge', 'python-dev']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue15573'
    versions = ['Python 3.3']

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 7, 2012

    Continuing the discussion from bpo-13072. I hit a snag here:

    Determining in full generality whether two format strings describe
    identical items is pretty complicated, see also bpo-3132.

    I'm attaching a best effort fmtcmp() function that should do the
    following:

    • recognize byte order specifiers at the start of the string.

    • recognize if an explicitly specified byte order happens to
      match the native byte order.

    It won't catch:

    • byte order specifiers anywhere in the string.

    • C types that happen to be identical ('I', 'L' on a 32-bit
      platform). I'm also not sure if that is desirable in the
      first place.

    • ???

    So fmtcmp() will return false negatives (not equal), but should be
    correct for *most* format strings that are actually in use.

    Mark, Meador: You did a lot of work on the struct module and of
    course on issue bpo-3132. Does this look like a reasonable compromise?
    Did I miss obvious cases (see attachment)?

    @skrah skrah mannequin added the release-blocker label Aug 7, 2012
    @skrah skrah mannequin self-assigned this Aug 7, 2012
    @skrah skrah mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Aug 7, 2012
    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Aug 7, 2012

    I confess I was thinking of an even simpler "format strings must be identical" fallback, but agree your way is better, since it reproduces the 3.2 behaviour in many more cases where ignoring the format string actually did the right thing.

    The struct docs for the byte order specifier specifically say "the first character of the format string can be used to indicate the byte order, size and alignment of the packed data", so treating format strings that include byte order markers elsewhere in the string as differing from each other if those markers are in different locations sounds fine to me.

    @meadori
    Copy link
    Member

    meadori commented Aug 8, 2012

    I agree that the general case is complicated. It will get even more complicated if the full of PEP-3118 gets implemented since it turns into a tree comparison. In general, I think you will probably have to compute some canonical form and then compare the canonical forms.

    Here are a few more cases that don't work out in the attached algorithm:

    1. Repeat characters - '2c' == 'cc'
    2. Whitespace - 'h h' == 'hh'

    Also, currently the byte order specifiers are always at the beginning of the string. We discussed in bpo-3132 scoping them per the nested structures, but decided to drop that unless somebody barks about it since it is fairly complicated without a clear benefit. So, I wouldn't worry about them being scattered through the string.

    This seems like sort of a slippery slope. I need to think about it more, but my first impression is that coming up with some way to compare format strings is going to be nasty.

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 8, 2012

    Right, byte order specifiers are always at the beginning of the string.
    That is at least something. I wonder if we should tighten PEP-3118 to
    demand a canonical form of format strings, such as (probably incomplete):

    • Whitespace is disallowed.

    • Except for 's', no zero count may be given.

    • A count of 1 (redundant) is disallowed.

    • Repeats must be specified in terms of count + single char.

    That still leaves the '=I' != '=L' problem. Why are there two
    specifiers describing uint32_t?

    Anyway, as Meador says, this can get tricky and I don't think this
    can be resolved before beta-2. I'm attaching a patch that should
    behave well for the restricted canonical form at least.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 9, 2012

    Can someone please explain why this is a release blocker? What is the specific regression from 3.2 that this deals with?

    @vstinner
    Copy link
    Member

    vstinner commented Aug 9, 2012

    What is the specific regression from 3.2 that this deals with?

    I don't know if it must be called a regression, but at least the behaviour is different in Python 3.2 and 3.3. For example, an Unicode array is no more equal to its memoryview:

    Python 3.3.0b1 (default:aaa68dce117e, Aug  9 2012, 22:45:00)                                                                                                   
    [GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux                                                                                                                
    Type "help", "copyright", "credits" or "license" for more information.                                                                                         
    >>> import array                                                                                                                                               
    >>> a=array.array('u', 'abc')
    >>> v=memoryview(a)                                                                                                                                            
    >>> a == v                                                                                                                                                     
    False                                                                                                                                                          
    
    ned$ python3
    Python 3.2.3 (default, Jun  8 2012, 05:40:07)                                                                                                                  
    [GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux2                                                                                                               
    Type "help", "copyright", "credits" or "license" for more information.                                                                                         
    >>> import array                                                                                                                                               
    >>> a=array.array('u', 'abc')
    >>> v=memoryview(a)                                                                                                                                            
    >>> a == v                                                                                                                                                     
    True

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 10, 2012

    haypo: thanks for stating the issue.

    ISTM that this classifies as an "obscure" bug: you have to use memoryviews, and you need to compare them for equality, and the comparison needs to be "non-trivial", where "trivial" is defined
    by "both are 1D byte arrays".

    While this is a bug, I think it still can be fixed in a bug fix
    release of 3.3, so un-blocking.

    I also think that as a first step, a specification needs to be
    drafted defining when exactly a memory view should compare equal
    with some other object. I can easily provide a specification that
    makes the current implementation "correct".

    @loewis loewis mannequin removed the release-blocker label Aug 10, 2012
    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 10, 2012

    I can easily provide a specification that makes the current implementation "correct"

    Yes, the current specification is: memoryview only attempts to
    compare arrays with known (single character native) formats and
    returns "not equal" otherwise.

    The problem is that for backwards compatibility memoryview accepts
    arrays with arbitrary format strings. In operations like tolist()
    it's possible to raise NotImplemented, but for equality comparisons
    that's not possible.

    Note that in 3.2 memoryview would return "equal" for arrays that
    simply aren't equal, if those arrays happen to have the same bit
    pattern.

    One way to deal with this is to demand a strict canonical form
    of format strings for PEP-3118, see msg167687.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 10, 2012

    Note that in 3.2 memoryview would return "equal" for arrays that
    simply aren't equal, if those arrays happen to have the same bit
    pattern.

    This is exactly my point. If a "memoryview" has the same "memory"
    as another, why are they then not rightfully considered equal?
    IOW, the 3.2 behavior looks fine to me.

    You apparently have a vision that equality should mean something
    different for memoryviews - please explicitly state what that
    view is. A mathematical definition ("two memoryviews A and B
    are equal iff ...") would be appreciated.

    One way to deal with this is to demand a strict canonical form
    of format strings for PEP-3118, see msg167687.

    You are talking about the solution already - I still don't know
    what the problem is exactly (not that I *need* to understand
    the problem, but at a minimum, the documentation should state
    what the intended behavior is - better if people would also
    agree that the proposed behavior is "reasonable").

    For 3.3, I see two approaches: either move backwards to the
    3.2 behavior and defer this change to 3.4 - this would make
    it release-critical indeed. Or move forward to a yet-to-be
    specified equality operation which as of now may not be
    implemented correctly, treating any improvement to it
    as a bug fix.

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 10, 2012

    PEP-3118 specifies strongly typed multi-dimensional arrays. The existing
    code in the 3.2 memoryview as well as numerous comments by Travis Oliphant
    make it clear that these capabilities were intended from the start for
    memoryview as well.

    Perhaps the name "memoryview" is a misnomer, since the boundaries between
    memoryview and NumPy's ndarray become blurry. In fact, the small
    implementation of an ndarray in Modules/_testbuffer.c is *also* a memoryview
    in some sense, since it can grab a buffer from an exporter and expose it in
    the same manner as memoryview.

    So what I implemented is certainly not only *my* vision. The PEP essentially
    describes NumPy arrays with an interchange format to convert between NumPy
    and PIL arrays.

    It is perhaps unfortunate that the protocol was named "buffer" protocol,
    since it is actually an advanced "array" protocol.

    NumPy arrays don't care about the raw memory. It is always the logical array
    that matters. For example, Fortran and C arrays have different bit patterns
    in memory but compare equal, a fact that the 3.2 implementation completely
    misses.

    Arrays v and w are equal iff format and shape are equal and for all valid
    indices allowed by shape

    memcmp((char *)PyBuffer_GetPointer(v, indices),
    (char *)PyBuffer_GetPointer(w, indices),
    itemsize) == 0.

    Equal formats of course imply equal itemsize.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 10, 2012

    Can you please elaborate on the specification:

    1. what does it mean that the formats of v and w are equal?

    2. Victor's clarification about this issue isn't about comparing two arrays, but an array with a string object. So: when is an array equal to some other (non-array) object?

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 10, 2012

    1. what does it mean that the formats of v and w are equal?

    I'm using array and Py_buffer interchangeably since a Py_buffer struct
    actually describes a multi-dimensional array. v and w are Py_buffer
    structs.

    So v.format must equal w.format, where format is a format string in
    struct module syntax. The topic of this issue is to determine under
    what circumstances two strings in struct module syntax are considered
    equal.

    1. Victor's clarification about this issue isn't about comparing
      two arrays, but an array with a string object. So: when is an
      array equal to some other (non-array) object?
    >>> a=array.array('u', 'abc')
    >>> v=memoryview(a)
    >>> a == v
    False

    memoryview can compare against any object with a getbufferproc, in this
    case array.array. memoryview_richcompare() calls PyObject_GetBuffer(other)
    and proceeds to compare its own internal Py_buffer v against the obtained
    Py_buffer w.

    In the case of v.format == w.format the fix for unknown formats is trivial:
    Just allow the comparison using v.itemsize == w.itemsize.

    However, the struct module format string syntax has multiple representations
    for the exact same formats, which makes a general fmtcmp() function tricky
    to write.

    Hence my proposal to demand a strict canonical form for PEP-3118 format
    strings, which would be a proper subset of struct module format strings.

    Example: "1Q 1h 1h 0c" must be written as "Q2h"

    The attached patch should largely implement this proposal. A canonical form
    is perhaps not such a severe restriction, since format strings should usually
    come from the exporting object. E.g. NumPy must translate its own internal
    format to struct module syntax anyway.

    Another option is to commit the patch that misses "1Q 1h 1h 0c" == "Q2h"
    now and aim for a completely general fmtcmp() function later.

    IMO any general fmtcmp() function should also be reasonably fast.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 10, 2012

    So v.format must equal w.format, where format is a format string in
    struct module syntax. The topic of this issue is to determine under
    what circumstances two strings in struct module syntax are considered
    equal.

    And that is exactly my question: We don't need a patch implementing
    it (yet), but a specification of what is to be implemented first.

    I know when two strings are equal (regardless of their syntax):
    if they have the same length, and contain the same characters in
    the same order. Apparently, you have a different notion of "equal"
    for strings in mind, please be explicitly what that notion is.

    memoryview can compare against any object with a getbufferproc, in this
    case array.array. memoryview_richcompare() calls PyObject_GetBuffer(other)
    and proceeds to compare its own internal Py_buffer v against the obtained
    Py_buffer w.

    Can this be expressed on Python level as well? I.e. is it correct
    to say: an array/buffer/memoryview A is equal to an object O iff
    A is equal to memoryview(O)? Or could it be that these two equivalences
    might reasonably differ?

    Hence my proposal to demand a strict canonical form for PEP-3118 format
    strings, which would be a proper subset of struct module format strings.

    Can you kindly phrase this as a specification? Who is demanding what
    from whom?

    Proposal: two format strings are equal if their canonical forms
    are equal strings. The canonical form C of a string S is created by ???

    However, it appears that you may have something different in mind
    where things are rejected/fail to work if the canonical form isn't
    originally provided by somebody (whom?)

    So another Proposal: two format strings are equal iff they are
    in both in canonical form and are equal strings.

    This would imply that a format string which is not in canonical
    form is not equal to any other strings, not even to itself, so
    this may still not be what you want. But I can't guess what it
    is that you want.

    @vstinner
    Copy link
    Member

    Can't we start with something simple (for ptyhon 3.3?), and elaborate
    later? In my specific example, both object have the same format string and
    the same content. So i expect that they are equal.
    Le 10 août 2012 19:47, "Martin v. Löwis" <report@bugs.python.org> a écrit :

    Martin v. Löwis added the comment:

    > So v.format must equal w.format, where format is a format string in
    > struct module syntax. The topic of this issue is to determine under
    > what circumstances two strings in struct module syntax are considered
    > equal.

    And that is exactly my question: We don't need a patch implementing
    it (yet), but a specification of what is to be implemented first.

    I know when two strings are equal (regardless of their syntax):
    if they have the same length, and contain the same characters in
    the same order. Apparently, you have a different notion of "equal"
    for strings in mind, please be explicitly what that notion is.

    > memoryview can compare against any object with a getbufferproc, in this
    > case array.array. memoryview_richcompare() calls
    PyObject_GetBuffer(other)
    > and proceeds to compare its own internal Py_buffer v against the obtained
    > Py_buffer w.

    Can this be expressed on Python level as well? I.e. is it correct
    to say: an array/buffer/memoryview A is equal to an object O iff
    A is equal to memoryview(O)? Or could it be that these two equivalences
    might reasonably differ?

    > Hence my proposal to demand a strict canonical form for PEP-3118 format
    > strings, which would be a proper subset of struct module format strings.

    Can you kindly phrase this as a specification? Who is demanding what
    from whom?

    Proposal: two format strings are equal if their canonical forms
    are equal strings. The canonical form C of a string S is created by ???

    However, it appears that you may have something different in mind
    where things are rejected/fail to work if the canonical form isn't
    originally provided by somebody (whom?)

    So another Proposal: two format strings are equal iff they are
    in both in canonical form and are equal strings.

    This would imply that a format string which is not in canonical
    form is not equal to any other strings, not even to itself, so
    this may still not be what you want. But I can't guess what it
    is that you want.

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue15573\>


    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 10, 2012

    Can't we start with something simple (for ptyhon 3.3?), and elaborate
    later?

    Sure: someone would have to make a proposal what exactly that is.
    IMO, the 3.2 definition *was* simple, but apparently it was considered
    too simple. So either Stefan gets to define his view of equality, or
    somebody else needs to make a (specific) counter-proposal.

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 10, 2012

    The ideal specification is:

    1. Arrays v and w are equal iff format and shape are equal and for all valid
      indices allowed by shape

      memcmp((char *)PyBuffer_GetPointer(v, indices),
      (char *)PyBuffer_GetPointer(w, indices),
      itemsize) == 0.

    2. Two format strings s and t are equal if canonical(s) == canonical(t).

    End ideal specification.

    Purely to *facilitate* the implementation of a format comparison function,
    I suggested:

    1. An exporter must initialize the format field of a Py_buffer structure
      with canonical(s).

    If *all* exporters obey 3), a format comparison function can simply
    call strcmp(s, t) (after sorting out the byte order specifier).

    Specifically, if x and y are equal, then:

    a) x == memoryview(x) == memoryview(y) == y

    If x and y are equal and exporter x does *not* obey 3), but exporter y does,
    then:

    b) x == memoryview(x) != memoryview(y) == y

    Under rule 3) this would be the fault of exporter x.

    For Python 3.3 it is also possible to state only 1) and 2), with a caveat in
    the documentation that case b) might occur until the format comparison function
    in memoryview implements the reductions to canonical forms.

    The problem is that reductions to canonical forms might get very complicated
    if bpo-3132 is implemented.

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 10, 2012

    Martin v. Loewis <report@bugs.python.org> wrote:

    Sure: someone would have to make a proposal what exactly that is.
    IMO, the 3.2 definition *was* simple, but apparently it was considered
    too simple.

    It was simply broken in multiple ways. Example:

    >>> from numpy import *
    >>> x = array([1,2,3,4,5], dtype='B')
    >>> y = array([5,4,3,2,1], dtype='B')
    >>> z = y[::-1]
    >>> 
    >>> x == z
    array([ True,  True,  True,  True,  True], dtype=bool)
    >>> memoryview(x) == memoryview(z)
    False
    Segmentation fault

    I'm not even talking about the segfault here. Note that x == z, but
    memoryview(x) != memoryview(z), because the logical structure is
    not taken into account.

    Likewise, one could construct cases where one array contains a float
    NaN and the other an integer that happens to have the same bit pattern.

    The arrays would not be equal, but their memoryviews would be equal.

    So either Stefan gets to define his view of equality, or
    somebody else needs to make a (specific) counter-proposal.

    The view is defined by the PEP that clearly models NumPy. I'm curious what
    counter-proposal will work with NumPy and PIL.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 11, 2012

    I find Stefan's proposed equality confusing. Why is it based on memcmp? Either it compares memory (i.e. internal representations), or it compares abstract values. If it compares abstract values, then it shouldn't be a requirement that the format strings are equal in any sense. Instead, the resulting values should be equal. So I propose this definition:

    v == w iff v.shape() == w.shape() and v.tolist() == w.tolist()
    if either operation fails with an exception, the objects are not equal

    Of course, the implementation doesn't need to literally call tolist; instead, behaving as-if it had been called is fine. However, as time
    is running out, I would actually propose this to be the implementation
    in 3.3.

    In addition, I would propose to support the 'u' and 'w' codes in tolist, to resolve what Victor says the actual issue is.

    I'm -1 on a definition that involves equivalence of format strings.

    @ncoghlan
    Copy link
    Contributor

    Stefan's proposed definition is correct. Shapes define array types, format
    strings define entry types, then the actual memory contents define the
    value.

    It's not "Stefan's definition of equivalence", it's what a statically typed
    array *means*.

    The 3.2 way is completely broken, as it considers arrays containing
    completely different data as equal if the memory layout happens to be the
    same by coincidence.

    3.3 is currently also broken, as it considers arrays that *do* contain the
    same values to be different.

    Stefan's patch fixes that by checking the shape and format first, and
    *then* checking the value. It's exactly the same as doing an instance check
    in an __eq__ method.

    The requirement for a canonical format is for sanity's sake: the general
    equivalence classes are too hard to figure out.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 11, 2012

    Nick: I still disagree. Would you agree that array.array constitutes a "statically typed array"? Yet

    py> array.array('b',b'foo') == array.array('B',b'foo')
    True
    py> array.array('i',[1,2,3]) == array.array('L', [1,2,3])
    True

    So the array object (rightfully) performs comparison on abstract values, not on memory representation. In Python, a statically typed array still conceptually contains abstract values, not memory blocks (this is also what Stefan asserts for NumPy in msg167862). The static typing only restricts the values you can store in the container, and defines the internal representation on the C level (plus it typically implies a value storage, instead of a reference storage).

    With your and Stefan's proposed semantics, we would get the weird case that for two array.arrays a and b, it might happen that

    a == b and memoryview(a) != memoryview(b)

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 11, 2012

    "it might *still* happen", I should say, since this problem is exactly what Victor says this issue intends to solve (for comparison of two 'u' arrays).

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 11, 2012

    So we have two competing proposals:

    1. Py_buffers are strongly typed arrays in the ML sense (e.g. array of float,
      array of int).

    This is probably what I'd prefer on the C level, imagine a function like
    function like PyBuffer_Compare(v, w).

    Backwards compatibility problem for users who were thinking in terms of
    value comparisons:

    >>> x = array.array('b', [127])
    >>> y = array.array('B', [127])
    >>> x == y
    True
    >>> memoryview(x) == memoryview(y)
    False
    1. Compare by value, like NumPy arrays do:
    >>> x = numpy.array([1, 2, 3], dtype='i')
    >>> y = numpy.array([1, 2, 3], dtype='f')
    >>> x == y
    array([True,  True,  True], dtype=bool)

    I concede that this is probably what users want to see on the Python level.

    Backwards compatibility problem for users who were using complicated
    structs:

    >>> from _testbuffer import *
    >>> x = ndarray([(1,1), (2,2), (3,3)], shape=[3], format='hQ')
    >>> x == memoryview(x)
    False

    Reason: While _testbuffer.ndarray already implements tolist() in full
    generality, memoryview does not:

    >>> x.tolist()
    [(1, 1), (2, 2), (3, 3)]
    
    >>> memoryview(x).tolist()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    NotImplementedError: memoryview: unsupported format hQ

    So while I'm beginning to like Martin's proposal, the implementation is
    certainly trickier and will always be quite slow for complicated format
    strings.

    It would be possible to keep a fast path for the primitive C types
    and use the code from _testbuffer.tolist() for the slow cases.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 11, 2012

    Here is a patch doing the comparison on abstract values if the formats are different.

    @ncoghlan
    Copy link
    Contributor

    OK, I think I finally understand what Martin is getting at from a semantic point of view, and I think I can better explain the background of the issue and why Stefan's proposed solution is both necessary and correct.

    The ideal definition of equivalence for memory view objects would actually be:

    memoryview(x) == memoryview(y)

    if (and only if)

    memoryview(x).tolist() == memoryview(y).tolist()

    Now, in practice, this approach cannot be implemented, because there are too many format definitions (whether valid or invalid) that memoryview doesn't understand (and perhaps will never understand) and because it would be completely infeasible on large arrays with complex format definitions.

    Thus, we are forced to accept a *constraint* on memoryview's definition of equality: individual values are always compared via raw memory comparison, thus values stored using different *sizes* or *layouts* in memory will always compare as unequal, even if they would compare as equal in Python

    This is an *acceptable* constraint as, in practice, you don't perform mixed format arithmetic and it's not a problem if there's no automatic coercion between sizes and layouts.

    The Python 3.2 memoryview effectively uses memcmp() directly treating everything as a 1D array of bytes data, completely ignoring both shape *and* format data. Thus:

    >>> ab = array('b', [1, 2, 3])
    >>> ai = array('i', [1, 2, 3])
    >>> aL = array('L', [1, 2, 3])
    >>> ab == ai
    True
    >>> ab == ai == aL
    True
    >>> memoryview(ab) == memoryview(ai)
    False
    >>> memoryview(ab) == memoryview(aL)
    False
    >>> memoryview(ai) == memoryview(aL)
    False

    This approach leads to some major false positives, such as a floating point value comparing equal to an integer that happens to share the same binary representation:

    >>> af = array('f', [1.1])
    >>> ai = array('i', [1066192077])
    >>> af == ai
    False
    >>> memoryview(af) == memoryview(ai)
    True

    The changes in 3.3 are aimed primarily at *eliminating those false positives* by taking into account the shape of the array and the format of the contained values. It is *not* about changing the fundamental constraint that memoryview operates at the level of raw memory, rather than Python objects, and thus cares about memory layout details that are irrelevant after passing through the Python abstraction layer.

    This contrasts with the more limited scope of the array.array module, which *does* take into account the Python level abstractions. Thus, there will always be a discrepancy between the two definitions of equality, as memoryview cares about memory layout details, where array.array does not.

    The problem at the moment is that Python 3.3 currently has *spurious* false negatives that aren't caused by that fundamental constraint that comparisons must occur based directly on memory contents. Instead, they're being caused by memoryview returning False for any equality comparison for a format it doesn't understand. That's unacceptable, and is what Stefan's patch is intended to fix.

    @ncoghlan
    Copy link
    Contributor

    Short version:
    3.3 implemented new constraints on memoryview equality comparisons to eliminate cases that were incorrectly returning True when they should have returned False.

    However, the format constraints are currently too restrictive, so some memoryview comparisons that correctly returned True in 3.2 are now also returning False. This patch fixes *those* comparisons to once again return True.

    Moving back to deferred blocked status, since the spurious false negatives are a regression from 3.2. It's fine for beta2 to ship with this problem, but it needs to be fixed for rc1.

    @ncoghlan
    Copy link
    Contributor

    Yeah, as soon as I got your email I realised I'd only been searching for usage *in the patch* rather than the full file. Oops :(

    More comments in the review.

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 15, 2012

    Here is a new patch that hopefully addresses all comments.

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 17, 2012

    There is still one corner case involving NaNs: Released memoryviews always
    compare equal. I took that over from the 3.2 implementation.

    >>> import array
    >>> a = array.array('d', [float('nan')])
    >>> m = memoryview(a)
    >>> m == m
    False
    >>> m.release()
    >>> m == m
    True

    I guess we have to live with that, since it is of course impossible to access
    the values of a released view.

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 19, 2012

    I think bpo-15573-struct-2.diff is ready to go and I'd rather commit
    sooner than later. Nick, can I interpret your last review comment
    as "go ahead"? :)

    @birkenfeld
    Copy link
    Member

    Small nit: when you put "versionchanged:: 3.3" there should be a hint of *what* changed exactly :)

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 19, 2012

    Right. I'm tracking all "versionchanged" issues in bpo-15724.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 24, 2012

    Is this still blocking the release? If so, it should be resolved within the next twelve hours, or else it may block the release until September.

    @birkenfeld
    Copy link
    Member

    I'm not very keen to hold up the release for long times again, especially for a patch of this size and lots of potential breakage.

    @vstinner
    Copy link
    Member

    Even if I would prefer to see a fully working new implementation of the
    buffer interface, I agree with Georg: it's too late for such huge change in
    Python 3.3. Can' we wait Python 3.4 to change the equality operator?

    It also took me three major releases to fix all Unicode issues (and it's
    probably not finished :-)).
    Le 24 août 2012 23:05, "Georg Brandl" <report@bugs.python.org> a écrit :

    Georg Brandl added the comment:

    I'm not very keen to hold up the release for long times again, especially
    for a patch of this size and lots of potential breakage.

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue15573\>


    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 25, 2012

    The effect of the change is pretty minimal though: Previously
    "not equal" was returned for unknown formats, now the struct
    module figures it out.

    memoryobject.c has 100% coverage except for a couple of lines
    that are either impossible to reach or very hard to reach.

    The new code is among the most heavily exercised lines, since
    in test_buffer.py the verify() function always asserts equality.

    You can look at the attached memoryobject.c.gcov: *All* failure paths
    are taken since Python API functions are wrapped in macros that
    set PyExc_FailAPIError and return failure at predefined points
    the test suite.

    @ncoghlan
    Copy link
    Contributor

    With Stefan's major improvements to the test suite for 3.3, the risk is low and I *really* don't want to spend the life of the 3.3 series excusing the current comparison behaviour.

    By contrast, explaining the shift in definition as moving from raw memory comparisons to by value comparisons is relatively easy.

    @birkenfeld
    Copy link
    Member

    Well, I'm not against you committing it as long as you commit it *right now*. I'm about to cut the tag.

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 25, 2012

    You can look at the attached memoryobject.c.gcov: *All* failure paths
    are taken since Python API functions are wrapped in macros that
    set PyExc_FailAPIError and return failure at predefined points
    the test suite.

    That should read: With a patch that wraps Python API functions ...

    Obviously the macros aren't in by default.

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 25, 2012

    Great, give me 20 minutes or so.

    @ncoghlan
    Copy link
    Contributor

    I'm currently on IRC and double checking the patch locally.

    @ncoghlan
    Copy link
    Contributor

    Stefan, was there something specific you wanted to do before committing? Otherwise I can commit it as soon as this full test run finishes.

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 25, 2012

    Except for Misc/NEWS the patch can be committed unchanged. Please
    go ahead!

    [The remaining doc changes are tracked elsewhere.]

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 25, 2012

    New changeset afa3dedfee18 by Nick Coghlan in branch 'default':
    Close bpo-15573: use value-based memoryview comparisons (patch by Stefan Krah)
    http://hg.python.org/cpython/rev/afa3dedfee18

    @python-dev python-dev mannequin closed this as completed Aug 25, 2012
    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 29, 2012

    We overlooked one thing. Since hashing is defined in terms of
    tobytes(), the equality-hash invariant is now broken:

    >>> from _testbuffer import ndarray
    >>> x = ndarray([1,2,3], shape=[3], format='f')
    >>> y = ndarray([1,2,3], shape=[3], format='B')
    >>> a = memoryview(x)
    >>> b = memoryview(y)
    >>> a == b
    True
    >>> hash(a) == hash(b)
    False
    >>> 

    This is one problem with the new equality definition. It puts
    "memoryview" firmly into array territory. I'm not saying that's
    wrong, I even think it was the intention of the PEP authors to
    have a zero copy "arrayview".

    Both array.array and numpy.array sidestep the issue by not being
    hashable.

    I don't really see a way around all this except doing slow
    element-wise hashing.

    @skrah skrah mannequin reopened this Aug 29, 2012
    @ncoghlan
    Copy link
    Contributor

    Perhaps the hash could just be based on a subset of the items? For example, hash the format, shape and the first and last items?

    Yes, such an approach means you're more likely to get collisions if you do start hashing these, but that seems better than making hashing itself excessively slow.

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 29, 2012

    I'm trying to think of an optimization for the native types. It should
    be possible to cast any native type element to unsigned char and use the
    truncated value for the bytes hash.

    Well, for double probably it's required to go from double -> int64_t ->
    unsigned char, otherwise the first cast is technically undefined for
    negative values, though it works with gcc.

    For non-native types and compound types, Nick's suggestion of
    hashing the shape and a couple of values seems to be the best
    solution.

    Should we do anything about this before 3.3.0?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 29, 2012

    1. I STRONGLY request to take hashing out of this issue. The only further action that this issue can have is complete reversal of the patch, starting over. Now that the issue has been resolved, a NEW issue arises, namely that hashing is inconsistent with equality. Please re-read the original issue: the objective of supporting unknown formats in memoryview comparisons is achieved. Remember that by reopening the issue, you are blocking the release (since it's still a release blocker).

    2. Any significant change to the 3.3.0 code base should, IMO, require an additional release candidate (since the release candidate really ought to be what is going to be released, hence the name, so NO changes between candidate and release). Do you really want to delay the release of Python 3.3 until October because of hashing of memoryview objects (which, AFAICT, only affects bytes objects in the standard library)?

    3. I don't think memoryview objects should be hashable at all. No matter what definition of hashing you chose, you likely won't manage to get it consistent with ==. Since a memoryview compares equal with its underlying object, it should also hash equal.

    4. (unrelated to hashing, related to the original issue) I think it will be possible to create non-transitive equalities, where A==memoryview(A)==memoryview(B)==B, yet not (A==B)

    5. I'm not sure whether the hash of a memoryview will surely be constant (as it should): isn't it possible to create a read-only buffer for a mutable memory block (not PyObject), so that any hashing considering the contents will result in changing hashes? Or, since the hash is cached, couldn't that cause the hash to deviate from ==?

    @skrah
    Copy link
    Mannequin Author

    skrah mannequin commented Aug 29, 2012

    I agree that this isn't a blocker due to the limited use of
    memoryview hashing. Tracking this in bpo-15814.

    @skrah skrah mannequin closed this as completed Aug 29, 2012
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 2, 2012

    New changeset 969069d464bc by Stefan Krah in branch '3.3':
    Issue bpo-15814: Use hash function that is compatible with the equality
    http://hg.python.org/cpython/rev/969069d464bc

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants