Support unknown formats in memoryview comparisons #59778

skrah · 2012-08-07T11:55:30Z

BPO	15573
Nosy	@loewis, @birkenfeld, @mdickinson, @ncoghlan, @pitrou, @vstinner, @tiran, @skrah, @meadori
Files	format.c issue15573.diff abstractcmp.diff issue15573-struct.diff: full struct module support for comparisons issue15573-struct-2.diff memoryobject.c.gcov

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/skrah'
closed_at = <Date 2012-08-29.17:09:27.233>
created_at = <Date 2012-08-07.11:55:30.289>
labels = ['interpreter-core', 'type-bug', 'release-blocker']
title = 'Support unknown formats in memoryview comparisons'
updated_at = <Date 2012-11-02.17:07:02.101>
user = 'https://github.com/skrah'

bugs.python.org fields:

activity = <Date 2012-11-02.17:07:02.101>
actor = 'python-dev'
assignee = 'skrah'
closed = True
closed_date = <Date 2012-08-29.17:09:27.233>
closer = 'skrah'
components = ['Interpreter Core']
creation = <Date 2012-08-07.11:55:30.289>
creator = 'skrah'
dependencies = []
files = ['26720', '26725', '26766', '26800', '26831', '26992']
hgrepos = []
issue_num = 15573
keywords = ['patch']
message_count = 63.0
messages = ['167618', '167621', '167655', '167687', '167838', '167844', '167855', '167856', '167860', '167862', '167884', '167892', '167899', '167902', '167908', '167911', '167912', '167930', '167946', '167981', '167982', '167984', '167987', '167990', '167991', '167993', '167994', '168000', '168002', '168003', '168004', '168006', '168010', '168028', '168029', '168031', '168032', '168188', '168202', '168257', '168301', '168440', '168559', '168560', '168564', '169023', '169085', '169106', '169107', '169108', '169109', '169110', '169111', '169112', '169113', '169114', '169115', '169347', '169355', '169388', '169394', '169399', '174541']
nosy_count = 11.0
nosy_names = ['loewis', 'georg.brandl', 'mark.dickinson', 'ncoghlan', 'pitrou', 'vstinner', 'christian.heimes', 'Arfrever', 'skrah', 'meador.inge', 'python-dev']
pr_nums = []
priority = 'release blocker'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue15573'
versions = ['Python 3.3']

skrah · 2012-08-07T11:55:28Z

Continuing the discussion from bpo-13072. I hit a snag here:

Determining in full generality whether two format strings describe
identical items is pretty complicated, see also bpo-3132.

I'm attaching a best effort fmtcmp() function that should do the
following:

recognize byte order specifiers at the start of the string.
recognize if an explicitly specified byte order happens to
match the native byte order.

It won't catch:

byte order specifiers anywhere in the string.
C types that happen to be identical ('I', 'L' on a 32-bit
platform). I'm also not sure if that is desirable in the
first place.
???

So fmtcmp() will return false negatives (not equal), but should be
correct for *most* format strings that are actually in use.

Mark, Meador: You did a lot of work on the struct module and of
course on issue bpo-3132. Does this look like a reasonable compromise?
Did I miss obvious cases (see attachment)?

ncoghlan · 2012-08-07T13:06:23Z

I confess I was thinking of an even simpler "format strings must be identical" fallback, but agree your way is better, since it reproduces the 3.2 behaviour in many more cases where ignoring the format string actually did the right thing.

The struct docs for the byte order specifier specifically say "the first character of the format string can be used to indicate the byte order, size and alignment of the packed data", so treating format strings that include byte order markers elsewhere in the string as differing from each other if those markers are in different locations sounds fine to me.

meadori · 2012-08-08T02:52:22Z

I agree that the general case is complicated. It will get even more complicated if the full of PEP-3118 gets implemented since it turns into a tree comparison. In general, I think you will probably have to compute some canonical form and then compare the canonical forms.

Here are a few more cases that don't work out in the attached algorithm:

Repeat characters - '2c' == 'cc'
Whitespace - 'h h' == 'hh'

Also, currently the byte order specifiers are always at the beginning of the string. We discussed in bpo-3132 scoping them per the nested structures, but decided to drop that unless somebody barks about it since it is fairly complicated without a clear benefit. So, I wouldn't worry about them being scattered through the string.

This seems like sort of a slippery slope. I need to think about it more, but my first impression is that coming up with some way to compare format strings is going to be nasty.

skrah · 2012-08-08T11:31:55Z

Right, byte order specifiers are always at the beginning of the string.
That is at least something. I wonder if we should tighten PEP-3118 to
demand a canonical form of format strings, such as (probably incomplete):

Whitespace is disallowed.
Except for 's', no zero count may be given.
A count of 1 (redundant) is disallowed.
Repeats must be specified in terms of count + single char.

That still leaves the '=I' != '=L' problem. Why are there two
specifiers describing uint32_t?

Anyway, as Meador says, this can get tricky and I don't think this
can be resolved before beta-2. I'm attaching a patch that should
behave well for the restricted canonical form at least.

loewis · 2012-08-09T21:47:04Z

Can someone please explain why this is a release blocker? What is the specific regression from 3.2 that this deals with?

vstinner · 2012-08-09T22:06:00Z

What is the specific regression from 3.2 that this deals with?

I don't know if it must be called a regression, but at least the behaviour is different in Python 3.2 and 3.3. For example, an Unicode array is no more equal to its memoryview:

Python 3.3.0b1 (default:aaa68dce117e, Aug  9 2012, 22:45:00)                                                                                                   
[GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux                                                                                                                
Type "help", "copyright", "credits" or "license" for more information.                                                                                         
>>> import array                                                                                                                                               
>>> a=array.array('u', 'abc')
>>> v=memoryview(a)                                                                                                                                            
>>> a == v                                                                                                                                                     
False                                                                                                                                                          

ned$ python3
Python 3.2.3 (default, Jun  8 2012, 05:40:07)                                                                                                                  
[GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux2                                                                                                               
Type "help", "copyright", "credits" or "license" for more information.                                                                                         
>>> import array                                                                                                                                               
>>> a=array.array('u', 'abc')
>>> v=memoryview(a)                                                                                                                                            
>>> a == v                                                                                                                                                     
True

loewis · 2012-08-10T06:29:49Z

haypo: thanks for stating the issue.

ISTM that this classifies as an "obscure" bug: you have to use memoryviews, and you need to compare them for equality, and the comparison needs to be "non-trivial", where "trivial" is defined
by "both are 1D byte arrays".

While this is a bug, I think it still can be fixed in a bug fix
release of 3.3, so un-blocking.

I also think that as a first step, a specification needs to be
drafted defining when exactly a memory view should compare equal
with some other object. I can easily provide a specification that
makes the current implementation "correct".

skrah · 2012-08-10T07:48:38Z

I can easily provide a specification that makes the current implementation "correct"

Yes, the current specification is: memoryview only attempts to
compare arrays with known (single character native) formats and
returns "not equal" otherwise.

The problem is that for backwards compatibility memoryview accepts
arrays with arbitrary format strings. In operations like tolist()
it's possible to raise NotImplemented, but for equality comparisons
that's not possible.

Note that in 3.2 memoryview would return "equal" for arrays that
simply aren't equal, if those arrays happen to have the same bit
pattern.

One way to deal with this is to demand a strict canonical form
of format strings for PEP-3118, see msg167687.

loewis · 2012-08-10T08:53:28Z

Note that in 3.2 memoryview would return "equal" for arrays that
simply aren't equal, if those arrays happen to have the same bit
pattern.

This is exactly my point. If a "memoryview" has the same "memory"
as another, why are they then not rightfully considered equal?
IOW, the 3.2 behavior looks fine to me.

You apparently have a vision that equality should mean something
different for memoryviews - please explicitly state what that
view is. A mathematical definition ("two memoryviews A and B
are equal iff ...") would be appreciated.

One way to deal with this is to demand a strict canonical form
of format strings for PEP-3118, see msg167687.

You are talking about the solution already - I still don't know
what the problem is exactly (not that I *need* to understand
the problem, but at a minimum, the documentation should state
what the intended behavior is - better if people would also
agree that the proposed behavior is "reasonable").

For 3.3, I see two approaches: either move backwards to the
3.2 behavior and defer this change to 3.4 - this would make
it release-critical indeed. Or move forward to a yet-to-be
specified equality operation which as of now may not be
implemented correctly, treating any improvement to it
as a bug fix.

skrah · 2012-08-10T09:53:15Z

PEP-3118 specifies strongly typed multi-dimensional arrays. The existing
code in the 3.2 memoryview as well as numerous comments by Travis Oliphant
make it clear that these capabilities were intended from the start for
memoryview as well.

Perhaps the name "memoryview" is a misnomer, since the boundaries between
memoryview and NumPy's ndarray become blurry. In fact, the small
implementation of an ndarray in Modules/_testbuffer.c is *also* a memoryview
in some sense, since it can grab a buffer from an exporter and expose it in
the same manner as memoryview.

So what I implemented is certainly not only *my* vision. The PEP essentially
describes NumPy arrays with an interchange format to convert between NumPy
and PIL arrays.

It is perhaps unfortunate that the protocol was named "buffer" protocol,
since it is actually an advanced "array" protocol.

NumPy arrays don't care about the raw memory. It is always the logical array
that matters. For example, Fortran and C arrays have different bit patterns
in memory but compare equal, a fact that the 3.2 implementation completely
misses.

Arrays v and w are equal iff format and shape are equal and for all valid
indices allowed by shape

memcmp((char *)PyBuffer_GetPointer(v, indices),
(char *)PyBuffer_GetPointer(w, indices),
itemsize) == 0.

Equal formats of course imply equal itemsize.

loewis · 2012-08-10T16:04:14Z

Can you please elaborate on the specification:

what does it mean that the formats of v and w are equal?
Victor's clarification about this issue isn't about comparing two arrays, but an array with a string object. So: when is an array equal to some other (non-array) object?

skrah · 2012-08-10T16:46:42Z

what does it mean that the formats of v and w are equal?

I'm using array and Py_buffer interchangeably since a Py_buffer struct
actually describes a multi-dimensional array. v and w are Py_buffer
structs.

So v.format must equal w.format, where format is a format string in
struct module syntax. The topic of this issue is to determine under
what circumstances two strings in struct module syntax are considered
equal.

Victor's clarification about this issue isn't about comparing
two arrays, but an array with a string object. So: when is an
array equal to some other (non-array) object?

>>> a=array.array('u', 'abc')
>>> v=memoryview(a)
>>> a == v
False

memoryview can compare against any object with a getbufferproc, in this
case array.array. memoryview_richcompare() calls PyObject_GetBuffer(other)
and proceeds to compare its own internal Py_buffer v against the obtained
Py_buffer w.

In the case of v.format == w.format the fix for unknown formats is trivial:
Just allow the comparison using v.itemsize == w.itemsize.

However, the struct module format string syntax has multiple representations
for the exact same formats, which makes a general fmtcmp() function tricky
to write.

Hence my proposal to demand a strict canonical form for PEP-3118 format
strings, which would be a proper subset of struct module format strings.

Example: "1Q 1h 1h 0c" must be written as "Q2h"

The attached patch should largely implement this proposal. A canonical form
is perhaps not such a severe restriction, since format strings should usually
come from the exporting object. E.g. NumPy must translate its own internal
format to struct module syntax anyway.

Another option is to commit the patch that misses "1Q 1h 1h 0c" == "Q2h"
now and aim for a completely general fmtcmp() function later.

IMO any general fmtcmp() function should also be reasonably fast.

loewis · 2012-08-10T17:47:09Z

So v.format must equal w.format, where format is a format string in
struct module syntax. The topic of this issue is to determine under
what circumstances two strings in struct module syntax are considered
equal.

And that is exactly my question: We don't need a patch implementing
it (yet), but a specification of what is to be implemented first.

I know when two strings are equal (regardless of their syntax):
if they have the same length, and contain the same characters in
the same order. Apparently, you have a different notion of "equal"
for strings in mind, please be explicitly what that notion is.

memoryview can compare against any object with a getbufferproc, in this
case array.array. memoryview_richcompare() calls PyObject_GetBuffer(other)
and proceeds to compare its own internal Py_buffer v against the obtained
Py_buffer w.

Can this be expressed on Python level as well? I.e. is it correct
to say: an array/buffer/memoryview A is equal to an object O iff
A is equal to memoryview(O)? Or could it be that these two equivalences
might reasonably differ?

Hence my proposal to demand a strict canonical form for PEP-3118 format
strings, which would be a proper subset of struct module format strings.

Can you kindly phrase this as a specification? Who is demanding what
from whom?

Proposal: two format strings are equal if their canonical forms
are equal strings. The canonical form C of a string S is created by ???

However, it appears that you may have something different in mind
where things are rejected/fail to work if the canonical form isn't
originally provided by somebody (whom?)

So another Proposal: two format strings are equal iff they are
in both in canonical form and are equal strings.

This would imply that a format string which is not in canonical
form is not equal to any other strings, not even to itself, so
this may still not be what you want. But I can't guess what it
is that you want.

vstinner · 2012-08-10T18:01:10Z

Can't we start with something simple (for ptyhon 3.3?), and elaborate
later? In my specific example, both object have the same format string and
the same content. So i expect that they are equal.
Le 10 août 2012 19:47, "Martin v. Löwis" <report@bugs.python.org> a écrit :

Martin v. Löwis added the comment:

> So v.format must equal w.format, where format is a format string in
> struct module syntax. The topic of this issue is to determine under
> what circumstances two strings in struct module syntax are considered
> equal.

And that is exactly my question: We don't need a patch implementing
it (yet), but a specification of what is to be implemented first.

I know when two strings are equal (regardless of their syntax):
if they have the same length, and contain the same characters in
the same order. Apparently, you have a different notion of "equal"
for strings in mind, please be explicitly what that notion is.

> memoryview can compare against any object with a getbufferproc, in this
> case array.array. memoryview_richcompare() calls
PyObject_GetBuffer(other)
> and proceeds to compare its own internal Py_buffer v against the obtained
> Py_buffer w.

Can this be expressed on Python level as well? I.e. is it correct
to say: an array/buffer/memoryview A is equal to an object O iff
A is equal to memoryview(O)? Or could it be that these two equivalences
might reasonably differ?

> Hence my proposal to demand a strict canonical form for PEP-3118 format
> strings, which would be a proper subset of struct module format strings.

Can you kindly phrase this as a specification? Who is demanding what
from whom?

Proposal: two format strings are equal if their canonical forms
are equal strings. The canonical form C of a string S is created by ???

However, it appears that you may have something different in mind
where things are rejected/fail to work if the canonical form isn't
originally provided by somebody (whom?)

So another Proposal: two format strings are equal iff they are
in both in canonical form and are equal strings.

This would imply that a format string which is not in canonical
form is not equal to any other strings, not even to itself, so
this may still not be what you want. But I can't guess what it
is that you want.

----------

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue15573\>

loewis · 2012-08-10T19:18:07Z

Can't we start with something simple (for ptyhon 3.3?), and elaborate
later?

Sure: someone would have to make a proposal what exactly that is.
IMO, the 3.2 definition *was* simple, but apparently it was considered
too simple. So either Stefan gets to define his view of equality, or
somebody else needs to make a (specific) counter-proposal.

skrah · 2012-08-10T20:05:48Z

The ideal specification is:

Arrays v and w are equal iff format and shape are equal and for all valid
indices allowed by shape

memcmp((char *)PyBuffer_GetPointer(v, indices),
(char *)PyBuffer_GetPointer(w, indices),
itemsize) == 0.
Two format strings s and t are equal if canonical(s) == canonical(t).

End ideal specification.

Purely to *facilitate* the implementation of a format comparison function,
I suggested:

An exporter must initialize the format field of a Py_buffer structure
with canonical(s).

If *all* exporters obey 3), a format comparison function can simply
call strcmp(s, t) (after sorting out the byte order specifier).

Specifically, if x and y are equal, then:

a) x == memoryview(x) == memoryview(y) == y

If x and y are equal and exporter x does *not* obey 3), but exporter y does,
then:

b) x == memoryview(x) != memoryview(y) == y

Under rule 3) this would be the fault of exporter x.

For Python 3.3 it is also possible to state only 1) and 2), with a caveat in
the documentation that case b) might occur until the format comparison function
in memoryview implements the reductions to canonical forms.

The problem is that reductions to canonical forms might get very complicated
if bpo-3132 is implemented.

skrah · 2012-08-10T20:53:41Z

Martin v. Loewis <report@bugs.python.org> wrote:

Sure: someone would have to make a proposal what exactly that is.
IMO, the 3.2 definition *was* simple, but apparently it was considered
too simple.

It was simply broken in multiple ways. Example:

>>> from numpy import *
>>> x = array([1,2,3,4,5], dtype='B')
>>> y = array([5,4,3,2,1], dtype='B')
>>> z = y[::-1]
>>> 
>>> x == z
array([ True,  True,  True,  True,  True], dtype=bool)
>>> memoryview(x) == memoryview(z)
False
Segmentation fault

I'm not even talking about the segfault here. Note that x == z, but
memoryview(x) != memoryview(z), because the logical structure is
not taken into account.

Likewise, one could construct cases where one array contains a float
NaN and the other an integer that happens to have the same bit pattern.

The arrays would not be equal, but their memoryviews would be equal.

So either Stefan gets to define his view of equality, or
somebody else needs to make a (specific) counter-proposal.

The view is defined by the PEP that clearly models NumPy. I'm curious what
counter-proposal will work with NumPy and PIL.

loewis · 2012-08-11T04:15:58Z

I find Stefan's proposed equality confusing. Why is it based on memcmp? Either it compares memory (i.e. internal representations), or it compares abstract values. If it compares abstract values, then it shouldn't be a requirement that the format strings are equal in any sense. Instead, the resulting values should be equal. So I propose this definition:

v == w iff v.shape() == w.shape() and v.tolist() == w.tolist()
if either operation fails with an exception, the objects are not equal

Of course, the implementation doesn't need to literally call tolist; instead, behaving as-if it had been called is fine. However, as time
is running out, I would actually propose this to be the implementation
in 3.3.

In addition, I would propose to support the 'u' and 'w' codes in tolist, to resolve what Victor says the actual issue is.

I'm -1 on a definition that involves equivalence of format strings.

ncoghlan · 2012-08-11T09:42:46Z

Stefan's proposed definition is correct. Shapes define array types, format
strings define entry types, then the actual memory contents define the
value.

It's not "Stefan's definition of equivalence", it's what a statically typed
array *means*.

The 3.2 way is completely broken, as it considers arrays containing
completely different data as equal if the memory layout happens to be the
same by coincidence.

3.3 is currently also broken, as it considers arrays that *do* contain the
same values to be different.

Stefan's patch fixes that by checking the shape and format first, and
*then* checking the value. It's exactly the same as doing an instance check
in an __eq__ method.

The requirement for a canonical format is for sanity's sake: the general
equivalence classes are too hard to figure out.

loewis · 2012-08-11T17:31:03Z

Nick: I still disagree. Would you agree that array.array constitutes a "statically typed array"? Yet

py> array.array('b',b'foo') == array.array('B',b'foo')
True
py> array.array('i',[1,2,3]) == array.array('L', [1,2,3])
True

So the array object (rightfully) performs comparison on abstract values, not on memory representation. In Python, a statically typed array still conceptually contains abstract values, not memory blocks (this is also what Stefan asserts for NumPy in msg167862). The static typing only restricts the values you can store in the container, and defines the internal representation on the C level (plus it typically implies a value storage, instead of a reference storage).

With your and Stefan's proposed semantics, we would get the weird case that for two array.arrays a and b, it might happen that

a == b and memoryview(a) != memoryview(b)

loewis · 2012-08-11T17:41:55Z

"it might *still* happen", I should say, since this problem is exactly what Victor says this issue intends to solve (for comparison of two 'u' arrays).

skrah · 2012-08-11T18:13:06Z

So we have two competing proposals:

Py_buffers are strongly typed arrays in the ML sense (e.g. array of float,
array of int).

This is probably what I'd prefer on the C level, imagine a function like
function like PyBuffer_Compare(v, w).

Backwards compatibility problem for users who were thinking in terms of
value comparisons:

>>> x = array.array('b', [127])
>>> y = array.array('B', [127])
>>> x == y
True
>>> memoryview(x) == memoryview(y)
False

Compare by value, like NumPy arrays do:

>>> x = numpy.array([1, 2, 3], dtype='i')
>>> y = numpy.array([1, 2, 3], dtype='f')
>>> x == y
array([True,  True,  True], dtype=bool)

I concede that this is probably what users want to see on the Python level.

Backwards compatibility problem for users who were using complicated
structs:

>>> from _testbuffer import *
>>> x = ndarray([(1,1), (2,2), (3,3)], shape=[3], format='hQ')
>>> x == memoryview(x)
False

Reason: While _testbuffer.ndarray already implements tolist() in full
generality, memoryview does not:

>>> x.tolist()
[(1, 1), (2, 2), (3, 3)]

>>> memoryview(x).tolist()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: memoryview: unsupported format hQ

So while I'm beginning to like Martin's proposal, the implementation is
certainly trickier and will always be quite slow for complicated format
strings.

It would be possible to keep a fast path for the primitive C types
and use the code from _testbuffer.tolist() for the slow cases.

loewis · 2012-08-11T18:22:36Z

Here is a patch doing the comparison on abstract values if the formats are different.

ncoghlan · 2012-08-11T18:58:06Z

OK, I think I finally understand what Martin is getting at from a semantic point of view, and I think I can better explain the background of the issue and why Stefan's proposed solution is both necessary and correct.

The ideal definition of equivalence for memory view objects would actually be:

memoryview(x) == memoryview(y)

if (and only if)

memoryview(x).tolist() == memoryview(y).tolist()

Now, in practice, this approach cannot be implemented, because there are too many format definitions (whether valid or invalid) that memoryview doesn't understand (and perhaps will never understand) and because it would be completely infeasible on large arrays with complex format definitions.

Thus, we are forced to accept a *constraint* on memoryview's definition of equality: individual values are always compared via raw memory comparison, thus values stored using different *sizes* or *layouts* in memory will always compare as unequal, even if they would compare as equal in Python

This is an *acceptable* constraint as, in practice, you don't perform mixed format arithmetic and it's not a problem if there's no automatic coercion between sizes and layouts.

The Python 3.2 memoryview effectively uses memcmp() directly treating everything as a 1D array of bytes data, completely ignoring both shape *and* format data. Thus:

>>> ab = array('b', [1, 2, 3])
>>> ai = array('i', [1, 2, 3])
>>> aL = array('L', [1, 2, 3])
>>> ab == ai
True
>>> ab == ai == aL
True
>>> memoryview(ab) == memoryview(ai)
False
>>> memoryview(ab) == memoryview(aL)
False
>>> memoryview(ai) == memoryview(aL)
False

This approach leads to some major false positives, such as a floating point value comparing equal to an integer that happens to share the same binary representation:

>>> af = array('f', [1.1])
>>> ai = array('i', [1066192077])
>>> af == ai
False
>>> memoryview(af) == memoryview(ai)
True

The changes in 3.3 are aimed primarily at *eliminating those false positives* by taking into account the shape of the array and the format of the contained values. It is *not* about changing the fundamental constraint that memoryview operates at the level of raw memory, rather than Python objects, and thus cares about memory layout details that are irrelevant after passing through the Python abstraction layer.

This contrasts with the more limited scope of the array.array module, which *does* take into account the Python level abstractions. Thus, there will always be a discrepancy between the two definitions of equality, as memoryview cares about memory layout details, where array.array does not.

The problem at the moment is that Python 3.3 currently has *spurious* false negatives that aren't caused by that fundamental constraint that comparisons must occur based directly on memory contents. Instead, they're being caused by memoryview returning False for any equality comparison for a format it doesn't understand. That's unacceptable, and is what Stefan's patch is intended to fix.

ncoghlan · 2012-08-11T19:03:28Z

Short version:
3.3 implemented new constraints on memoryview equality comparisons to eliminate cases that were incorrectly returning True when they should have returned False.

However, the format constraints are currently too restrictive, so some memoryview comparisons that correctly returned True in 3.2 are now also returning False. This patch fixes *those* comparisons to once again return True.

Moving back to deferred blocked status, since the spurious false negatives are a regression from 3.2. It's fine for beta2 to ship with this problem, but it needs to be fixed for rc1.

ncoghlan · 2012-08-15T03:41:21Z

Yeah, as soon as I got your email I realised I'd only been searching for usage *in the patch* rather than the full file. Oops :(

More comments in the review.

skrah · 2012-08-15T14:37:17Z

Here is a new patch that hopefully addresses all comments.

skrah · 2012-08-17T11:48:30Z

There is still one corner case involving NaNs: Released memoryviews always
compare equal. I took that over from the 3.2 implementation.

>>> import array
>>> a = array.array('d', [float('nan')])
>>> m = memoryview(a)
>>> m == m
False
>>> m.release()
>>> m == m
True

I guess we have to live with that, since it is of course impossible to access
the values of a released view.

skrah · 2012-08-19T11:30:02Z

I think bpo-15573-struct-2.diff is ready to go and I'd rather commit
sooner than later. Nick, can I interpret your last review comment
as "go ahead"? :)

birkenfeld · 2012-08-19T11:36:06Z

Small nit: when you put "versionchanged:: 3.3" there should be a hint of *what* changed exactly :)

skrah · 2012-08-19T11:56:46Z

Right. I'm tracking all "versionchanged" issues in bpo-15724.

loewis · 2012-08-24T14:22:47Z

Is this still blocking the release? If so, it should be resolved within the next twelve hours, or else it may block the release until September.

birkenfeld · 2012-08-24T21:05:36Z

I'm not very keen to hold up the release for long times again, especially for a patch of this size and lots of potential breakage.

vstinner · 2012-08-25T07:09:44Z

Even if I would prefer to see a fully working new implementation of the
buffer interface, I agree with Georg: it's too late for such huge change in
Python 3.3. Can' we wait Python 3.4 to change the equality operator?

It also took me three major releases to fix all Unicode issues (and it's
probably not finished :-)).
Le 24 août 2012 23:05, "Georg Brandl" <report@bugs.python.org> a écrit :

Georg Brandl added the comment:

I'm not very keen to hold up the release for long times again, especially
for a patch of this size and lots of potential breakage.

----------

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue15573\>

skrah · 2012-08-25T07:28:51Z

The effect of the change is pretty minimal though: Previously
"not equal" was returned for unknown formats, now the struct
module figures it out.

memoryobject.c has 100% coverage except for a couple of lines
that are either impossible to reach or very hard to reach.

The new code is among the most heavily exercised lines, since
in test_buffer.py the verify() function always asserts equality.

You can look at the attached memoryobject.c.gcov: *All* failure paths
are taken since Python API functions are wrapped in macros that
set PyExc_FailAPIError and return failure at predefined points
the test suite.

ncoghlan · 2012-08-25T07:39:42Z

With Stefan's major improvements to the test suite for 3.3, the risk is low and I *really* don't want to spend the life of the 3.3 series excusing the current comparison behaviour.

By contrast, explaining the shift in definition as moving from raw memory comparisons to by value comparisons is relatively easy.

birkenfeld · 2012-08-25T07:41:50Z

Well, I'm not against you committing it as long as you commit it *right now*. I'm about to cut the tag.

skrah · 2012-08-25T07:44:15Z

You can look at the attached memoryobject.c.gcov: *All* failure paths
are taken since Python API functions are wrapped in macros that
set PyExc_FailAPIError and return failure at predefined points
the test suite.

That should read: With a patch that wraps Python API functions ...

Obviously the macros aren't in by default.

skrah · 2012-08-25T07:45:39Z

Great, give me 20 minutes or so.

ncoghlan · 2012-08-25T07:47:22Z

I'm currently on IRC and double checking the patch locally.

ncoghlan · 2012-08-25T07:49:18Z

Stefan, was there something specific you wanted to do before committing? Otherwise I can commit it as soon as this full test run finishes.

skrah · 2012-08-25T07:56:30Z

Except for Misc/NEWS the patch can be committed unchanged. Please
go ahead!

[The remaining doc changes are tracked elsewhere.]

python-dev · 2012-08-25T08:00:01Z

New changeset afa3dedfee18 by Nick Coghlan in branch 'default':
Close bpo-15573: use value-based memoryview comparisons (patch by Stefan Krah)
http://hg.python.org/cpython/rev/afa3dedfee18

skrah · 2012-08-29T07:36:19Z

We overlooked one thing. Since hashing is defined in terms of
tobytes(), the equality-hash invariant is now broken:

>>> from _testbuffer import ndarray
>>> x = ndarray([1,2,3], shape=[3], format='f')
>>> y = ndarray([1,2,3], shape=[3], format='B')
>>> a = memoryview(x)
>>> b = memoryview(y)
>>> a == b
True
>>> hash(a) == hash(b)
False
>>>

This is one problem with the new equality definition. It puts
"memoryview" firmly into array territory. I'm not saying that's
wrong, I even think it was the intention of the PEP authors to
have a zero copy "arrayview".

Both array.array and numpy.array sidestep the issue by not being
hashable.

I don't really see a way around all this except doing slow
element-wise hashing.

ncoghlan · 2012-08-29T09:51:27Z

Perhaps the hash could just be based on a subset of the items? For example, hash the format, shape and the first and last items?

Yes, such an approach means you're more likely to get collisions if you do start hashing these, but that seems better than making hashing itself excessively slow.

skrah · 2012-08-29T15:23:06Z

I'm trying to think of an optimization for the native types. It should
be possible to cast any native type element to unsigned char and use the
truncated value for the bytes hash.

Well, for double probably it's required to go from double -> int64_t ->
unsigned char, otherwise the first cast is technically undefined for
negative values, though it works with gcc.

For non-native types and compound types, Nick's suggestion of
hashing the shape and a couple of values seems to be the best
solution.

Should we do anything about this before 3.3.0?

loewis · 2012-08-29T16:39:50Z

I STRONGLY request to take hashing out of this issue. The only further action that this issue can have is complete reversal of the patch, starting over. Now that the issue has been resolved, a NEW issue arises, namely that hashing is inconsistent with equality. Please re-read the original issue: the objective of supporting unknown formats in memoryview comparisons is achieved. Remember that by reopening the issue, you are blocking the release (since it's still a release blocker).
Any significant change to the 3.3.0 code base should, IMO, require an additional release candidate (since the release candidate really ought to be what is going to be released, hence the name, so NO changes between candidate and release). Do you really want to delay the release of Python 3.3 until October because of hashing of memoryview objects (which, AFAICT, only affects bytes objects in the standard library)?
I don't think memoryview objects should be hashable at all. No matter what definition of hashing you chose, you likely won't manage to get it consistent with ==. Since a memoryview compares equal with its underlying object, it should also hash equal.
(unrelated to hashing, related to the original issue) I think it will be possible to create non-transitive equalities, where A==memoryview(A)==memoryview(B)==B, yet not (A==B)
I'm not sure whether the hash of a memoryview will surely be constant (as it should): isn't it possible to create a read-only buffer for a mutable memory block (not PyObject), so that any hashing considering the contents will result in changing hashes? Or, since the hash is cached, couldn't that cause the hash to deviate from ==?

skrah · 2012-08-29T17:09:27Z

I agree that this isn't a blocker due to the limited use of
memoryview hashing. Tracking this in bpo-15814.

python-dev · 2012-11-02T17:07:02Z

New changeset 969069d464bc by Stefan Krah in branch '3.3':
Issue bpo-15814: Use hash function that is compatible with the equality
http://hg.python.org/cpython/rev/969069d464bc

skrah mannequin added the release-blocker label Aug 7, 2012

skrah mannequin self-assigned this Aug 7, 2012

skrah mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Aug 7, 2012

loewis mannequin removed the release-blocker label Aug 10, 2012

birkenfeld added release-blocker and removed deferred-blocker labels Aug 19, 2012

python-dev mannequin closed this as completed Aug 25, 2012

skrah mannequin reopened this Aug 29, 2012

skrah mannequin closed this as completed Aug 29, 2012

ezio-melotti transferred this issue from another repository Apr 10, 2022

Support unknown formats in memoryview comparisons #59778

Support unknown formats in memoryview comparisons #59778

Comments

skrah mannequin commented Aug 7, 2012

skrah mannequin commented Aug 7, 2012

ncoghlan commented Aug 7, 2012

meadori commented Aug 8, 2012

skrah mannequin commented Aug 8, 2012

loewis mannequin commented Aug 9, 2012

vstinner commented Aug 9, 2012

loewis mannequin commented Aug 10, 2012

skrah mannequin commented Aug 10, 2012

loewis mannequin commented Aug 10, 2012

skrah mannequin commented Aug 10, 2012

loewis mannequin commented Aug 10, 2012

skrah mannequin commented Aug 10, 2012

loewis mannequin commented Aug 10, 2012

vstinner commented Aug 10, 2012

loewis mannequin commented Aug 10, 2012

skrah mannequin commented Aug 10, 2012

skrah mannequin commented Aug 10, 2012

loewis mannequin commented Aug 11, 2012

ncoghlan commented Aug 11, 2012

loewis mannequin commented Aug 11, 2012

loewis mannequin commented Aug 11, 2012

skrah mannequin commented Aug 11, 2012

loewis mannequin commented Aug 11, 2012

ncoghlan commented Aug 11, 2012

ncoghlan commented Aug 11, 2012

ncoghlan commented Aug 15, 2012

skrah mannequin commented Aug 15, 2012

skrah mannequin commented Aug 17, 2012

skrah mannequin commented Aug 19, 2012

birkenfeld commented Aug 19, 2012

skrah mannequin commented Aug 19, 2012

loewis mannequin commented Aug 24, 2012

birkenfeld commented Aug 24, 2012

vstinner commented Aug 25, 2012

skrah mannequin commented Aug 25, 2012

ncoghlan commented Aug 25, 2012

birkenfeld commented Aug 25, 2012

skrah mannequin commented Aug 25, 2012

skrah mannequin commented Aug 25, 2012

ncoghlan commented Aug 25, 2012

ncoghlan commented Aug 25, 2012

skrah mannequin commented Aug 25, 2012

python-dev mannequin commented Aug 25, 2012

skrah mannequin commented Aug 29, 2012

ncoghlan commented Aug 29, 2012

skrah mannequin commented Aug 29, 2012

loewis mannequin commented Aug 29, 2012

skrah mannequin commented Aug 29, 2012

python-dev mannequin commented Nov 2, 2012