New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support unknown formats in memoryview comparisons #59778
Comments
Continuing the discussion from bpo-13072. I hit a snag here: Determining in full generality whether two format strings describe I'm attaching a best effort fmtcmp() function that should do the
It won't catch:
So fmtcmp() will return false negatives (not equal), but should be Mark, Meador: You did a lot of work on the struct module and of |
I confess I was thinking of an even simpler "format strings must be identical" fallback, but agree your way is better, since it reproduces the 3.2 behaviour in many more cases where ignoring the format string actually did the right thing. The struct docs for the byte order specifier specifically say "the first character of the format string can be used to indicate the byte order, size and alignment of the packed data", so treating format strings that include byte order markers elsewhere in the string as differing from each other if those markers are in different locations sounds fine to me. |
I agree that the general case is complicated. It will get even more complicated if the full of PEP-3118 gets implemented since it turns into a tree comparison. In general, I think you will probably have to compute some canonical form and then compare the canonical forms. Here are a few more cases that don't work out in the attached algorithm:
Also, currently the byte order specifiers are always at the beginning of the string. We discussed in bpo-3132 scoping them per the nested structures, but decided to drop that unless somebody barks about it since it is fairly complicated without a clear benefit. So, I wouldn't worry about them being scattered through the string. This seems like sort of a slippery slope. I need to think about it more, but my first impression is that coming up with some way to compare format strings is going to be nasty. |
Right, byte order specifiers are always at the beginning of the string.
That still leaves the '=I' != '=L' problem. Why are there two Anyway, as Meador says, this can get tricky and I don't think this |
Can someone please explain why this is a release blocker? What is the specific regression from 3.2 that this deals with? |
I don't know if it must be called a regression, but at least the behaviour is different in Python 3.2 and 3.3. For example, an Unicode array is no more equal to its memoryview: Python 3.3.0b1 (default:aaa68dce117e, Aug 9 2012, 22:45:00)
[GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import array
>>> a=array.array('u', 'abc')
>>> v=memoryview(a)
>>> a == v
False
ned$ python3
Python 3.2.3 (default, Jun 8 2012, 05:40:07)
[GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import array
>>> a=array.array('u', 'abc')
>>> v=memoryview(a)
>>> a == v
True |
haypo: thanks for stating the issue. ISTM that this classifies as an "obscure" bug: you have to use memoryviews, and you need to compare them for equality, and the comparison needs to be "non-trivial", where "trivial" is defined While this is a bug, I think it still can be fixed in a bug fix I also think that as a first step, a specification needs to be |
Yes, the current specification is: memoryview only attempts to The problem is that for backwards compatibility memoryview accepts Note that in 3.2 memoryview would return "equal" for arrays that One way to deal with this is to demand a strict canonical form |
This is exactly my point. If a "memoryview" has the same "memory" You apparently have a vision that equality should mean something
You are talking about the solution already - I still don't know For 3.3, I see two approaches: either move backwards to the |
PEP-3118 specifies strongly typed multi-dimensional arrays. The existing Perhaps the name "memoryview" is a misnomer, since the boundaries between So what I implemented is certainly not only *my* vision. The PEP essentially It is perhaps unfortunate that the protocol was named "buffer" protocol, NumPy arrays don't care about the raw memory. It is always the logical array Arrays v and w are equal iff format and shape are equal and for all valid memcmp((char *)PyBuffer_GetPointer(v, indices), Equal formats of course imply equal itemsize. |
Can you please elaborate on the specification:
|
I'm using array and Py_buffer interchangeably since a Py_buffer struct So v.format must equal w.format, where format is a format string in
>>> a=array.array('u', 'abc')
>>> v=memoryview(a)
>>> a == v
False memoryview can compare against any object with a getbufferproc, in this In the case of v.format == w.format the fix for unknown formats is trivial: However, the struct module format string syntax has multiple representations Hence my proposal to demand a strict canonical form for PEP-3118 format Example: "1Q 1h 1h 0c" must be written as "Q2h" The attached patch should largely implement this proposal. A canonical form Another option is to commit the patch that misses "1Q 1h 1h 0c" == "Q2h" IMO any general fmtcmp() function should also be reasonably fast. |
And that is exactly my question: We don't need a patch implementing I know when two strings are equal (regardless of their syntax):
Can this be expressed on Python level as well? I.e. is it correct
Can you kindly phrase this as a specification? Who is demanding what Proposal: two format strings are equal if their canonical forms However, it appears that you may have something different in mind So another Proposal: two format strings are equal iff they are This would imply that a format string which is not in canonical |
Can't we start with something simple (for ptyhon 3.3?), and elaborate
|
Sure: someone would have to make a proposal what exactly that is. |
The ideal specification is:
End ideal specification. Purely to *facilitate* the implementation of a format comparison function,
If *all* exporters obey 3), a format comparison function can simply Specifically, if x and y are equal, then: a) x == memoryview(x) == memoryview(y) == y If x and y are equal and exporter x does *not* obey 3), but exporter y does, b) x == memoryview(x) != memoryview(y) == y Under rule 3) this would be the fault of exporter x. For Python 3.3 it is also possible to state only 1) and 2), with a caveat in The problem is that reductions to canonical forms might get very complicated |
Martin v. Loewis <report@bugs.python.org> wrote:
It was simply broken in multiple ways. Example: >>> from numpy import *
>>> x = array([1,2,3,4,5], dtype='B')
>>> y = array([5,4,3,2,1], dtype='B')
>>> z = y[::-1]
>>>
>>> x == z
array([ True, True, True, True, True], dtype=bool)
>>> memoryview(x) == memoryview(z)
False
Segmentation fault I'm not even talking about the segfault here. Note that x == z, but Likewise, one could construct cases where one array contains a float The arrays would not be equal, but their memoryviews would be equal.
The view is defined by the PEP that clearly models NumPy. I'm curious what |
I find Stefan's proposed equality confusing. Why is it based on memcmp? Either it compares memory (i.e. internal representations), or it compares abstract values. If it compares abstract values, then it shouldn't be a requirement that the format strings are equal in any sense. Instead, the resulting values should be equal. So I propose this definition: v == w iff v.shape() == w.shape() and v.tolist() == w.tolist() Of course, the implementation doesn't need to literally call tolist; instead, behaving as-if it had been called is fine. However, as time In addition, I would propose to support the 'u' and 'w' codes in tolist, to resolve what Victor says the actual issue is. I'm -1 on a definition that involves equivalence of format strings. |
Stefan's proposed definition is correct. Shapes define array types, format It's not "Stefan's definition of equivalence", it's what a statically typed The 3.2 way is completely broken, as it considers arrays containing 3.3 is currently also broken, as it considers arrays that *do* contain the Stefan's patch fixes that by checking the shape and format first, and The requirement for a canonical format is for sanity's sake: the general |
Nick: I still disagree. Would you agree that array.array constitutes a "statically typed array"? Yet py> array.array('b',b'foo') == array.array('B',b'foo') So the array object (rightfully) performs comparison on abstract values, not on memory representation. In Python, a statically typed array still conceptually contains abstract values, not memory blocks (this is also what Stefan asserts for NumPy in msg167862). The static typing only restricts the values you can store in the container, and defines the internal representation on the C level (plus it typically implies a value storage, instead of a reference storage). With your and Stefan's proposed semantics, we would get the weird case that for two array.arrays a and b, it might happen that a == b and memoryview(a) != memoryview(b) |
"it might *still* happen", I should say, since this problem is exactly what Victor says this issue intends to solve (for comparison of two 'u' arrays). |
So we have two competing proposals:
This is probably what I'd prefer on the C level, imagine a function like Backwards compatibility problem for users who were thinking in terms of >>> x = array.array('b', [127])
>>> y = array.array('B', [127])
>>> x == y
True
>>> memoryview(x) == memoryview(y)
False
>>> x = numpy.array([1, 2, 3], dtype='i')
>>> y = numpy.array([1, 2, 3], dtype='f')
>>> x == y
array([True, True, True], dtype=bool) I concede that this is probably what users want to see on the Python level. Backwards compatibility problem for users who were using complicated >>> from _testbuffer import *
>>> x = ndarray([(1,1), (2,2), (3,3)], shape=[3], format='hQ')
>>> x == memoryview(x)
False Reason: While _testbuffer.ndarray already implements tolist() in full >>> x.tolist()
[(1, 1), (2, 2), (3, 3)]
>>> memoryview(x).tolist()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NotImplementedError: memoryview: unsupported format hQ So while I'm beginning to like Martin's proposal, the implementation is It would be possible to keep a fast path for the primitive C types |
Here is a patch doing the comparison on abstract values if the formats are different. |
OK, I think I finally understand what Martin is getting at from a semantic point of view, and I think I can better explain the background of the issue and why Stefan's proposed solution is both necessary and correct. The ideal definition of equivalence for memory view objects would actually be: memoryview(x) == memoryview(y) if (and only if) memoryview(x).tolist() == memoryview(y).tolist() Now, in practice, this approach cannot be implemented, because there are too many format definitions (whether valid or invalid) that memoryview doesn't understand (and perhaps will never understand) and because it would be completely infeasible on large arrays with complex format definitions. Thus, we are forced to accept a *constraint* on memoryview's definition of equality: individual values are always compared via raw memory comparison, thus values stored using different *sizes* or *layouts* in memory will always compare as unequal, even if they would compare as equal in Python This is an *acceptable* constraint as, in practice, you don't perform mixed format arithmetic and it's not a problem if there's no automatic coercion between sizes and layouts. The Python 3.2 memoryview effectively uses memcmp() directly treating everything as a 1D array of bytes data, completely ignoring both shape *and* format data. Thus: >>> ab = array('b', [1, 2, 3])
>>> ai = array('i', [1, 2, 3])
>>> aL = array('L', [1, 2, 3])
>>> ab == ai
True
>>> ab == ai == aL
True
>>> memoryview(ab) == memoryview(ai)
False
>>> memoryview(ab) == memoryview(aL)
False
>>> memoryview(ai) == memoryview(aL)
False This approach leads to some major false positives, such as a floating point value comparing equal to an integer that happens to share the same binary representation: >>> af = array('f', [1.1])
>>> ai = array('i', [1066192077])
>>> af == ai
False
>>> memoryview(af) == memoryview(ai)
True The changes in 3.3 are aimed primarily at *eliminating those false positives* by taking into account the shape of the array and the format of the contained values. It is *not* about changing the fundamental constraint that memoryview operates at the level of raw memory, rather than Python objects, and thus cares about memory layout details that are irrelevant after passing through the Python abstraction layer. This contrasts with the more limited scope of the array.array module, which *does* take into account the Python level abstractions. Thus, there will always be a discrepancy between the two definitions of equality, as memoryview cares about memory layout details, where array.array does not. The problem at the moment is that Python 3.3 currently has *spurious* false negatives that aren't caused by that fundamental constraint that comparisons must occur based directly on memory contents. Instead, they're being caused by memoryview returning False for any equality comparison for a format it doesn't understand. That's unacceptable, and is what Stefan's patch is intended to fix. |
Short version: However, the format constraints are currently too restrictive, so some memoryview comparisons that correctly returned True in 3.2 are now also returning False. This patch fixes *those* comparisons to once again return True. Moving back to deferred blocked status, since the spurious false negatives are a regression from 3.2. It's fine for beta2 to ship with this problem, but it needs to be fixed for rc1. |
Yeah, as soon as I got your email I realised I'd only been searching for usage *in the patch* rather than the full file. Oops :( More comments in the review. |
Here is a new patch that hopefully addresses all comments. |
There is still one corner case involving NaNs: Released memoryviews always >>> import array
>>> a = array.array('d', [float('nan')])
>>> m = memoryview(a)
>>> m == m
False
>>> m.release()
>>> m == m
True I guess we have to live with that, since it is of course impossible to access |
I think bpo-15573-struct-2.diff is ready to go and I'd rather commit |
Small nit: when you put "versionchanged:: 3.3" there should be a hint of *what* changed exactly :) |
Right. I'm tracking all "versionchanged" issues in bpo-15724. |
Is this still blocking the release? If so, it should be resolved within the next twelve hours, or else it may block the release until September. |
I'm not very keen to hold up the release for long times again, especially for a patch of this size and lots of potential breakage. |
Even if I would prefer to see a fully working new implementation of the It also took me three major releases to fix all Unicode issues (and it's
|
The effect of the change is pretty minimal though: Previously memoryobject.c has 100% coverage except for a couple of lines The new code is among the most heavily exercised lines, since You can look at the attached memoryobject.c.gcov: *All* failure paths |
With Stefan's major improvements to the test suite for 3.3, the risk is low and I *really* don't want to spend the life of the 3.3 series excusing the current comparison behaviour. By contrast, explaining the shift in definition as moving from raw memory comparisons to by value comparisons is relatively easy. |
Well, I'm not against you committing it as long as you commit it *right now*. I'm about to cut the tag. |
That should read: With a patch that wraps Python API functions ... Obviously the macros aren't in by default. |
Great, give me 20 minutes or so. |
I'm currently on IRC and double checking the patch locally. |
Stefan, was there something specific you wanted to do before committing? Otherwise I can commit it as soon as this full test run finishes. |
Except for Misc/NEWS the patch can be committed unchanged. Please [The remaining doc changes are tracked elsewhere.] |
New changeset afa3dedfee18 by Nick Coghlan in branch 'default': |
We overlooked one thing. Since hashing is defined in terms of >>> from _testbuffer import ndarray
>>> x = ndarray([1,2,3], shape=[3], format='f')
>>> y = ndarray([1,2,3], shape=[3], format='B')
>>> a = memoryview(x)
>>> b = memoryview(y)
>>> a == b
True
>>> hash(a) == hash(b)
False
>>> This is one problem with the new equality definition. It puts Both array.array and numpy.array sidestep the issue by not being I don't really see a way around all this except doing slow |
Perhaps the hash could just be based on a subset of the items? For example, hash the format, shape and the first and last items? Yes, such an approach means you're more likely to get collisions if you do start hashing these, but that seems better than making hashing itself excessively slow. |
I'm trying to think of an optimization for the native types. It should Well, for double probably it's required to go from double -> int64_t -> For non-native types and compound types, Nick's suggestion of Should we do anything about this before 3.3.0? |
|
I agree that this isn't a blocker due to the limited use of |
New changeset 969069d464bc by Stefan Krah in branch '3.3': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: