classification
Title: Inconsistent hash and comparison for code objects
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: akuchling Nosy List: Jeremy.Hylton, akuchling, brett.cannon, eltoder, eric.araujo, math_foo, ncoghlan, pitrou, python-dev, rhettinger
Priority: normal Keywords: patch

Created on 2011-05-03 02:11 by eltoder, last changed 2014-04-14 18:21 by akuchling. This issue is now closed.

Files
File name Uploaded Description Edit
comment11983.patch math_foo, 2014-04-14 17:40 review
Messages (14)
msg135015 - (view) Author: Eugene Toder (eltoder) * Date: 2011-05-03 02:11
A comment in the definition of PyCodeObject in Include/code.h says:

/* The rest doesn't count for hash or comparisons */

which, I think, makes a lot of sense. The implementation doesn't follow this comment, though. code_hash actually includes co_name and code_richcompare includes co_name and co_firstlineno. This makes hash and comparison inconsistent with each other and with the comment.
msg135321 - (view) Author: √Čric Araujo (eric.araujo) * (Python committer) Date: 2011-05-06 16:41
What fix would you propose?
msg135395 - (view) Author: Eugene Toder (eltoder) * Date: 2011-05-07 01:07
I would propose changing implementation to match the comment. At a minimum, remove co_firstlineno comparison. As the last resort, at least change the comment.
msg135541 - (view) Author: Eugene Toder (eltoder) * Date: 2011-05-08 21:13
It appears that

* co_name was added to hash and cmp in this check-in by Guido:
http://hg.python.org/cpython-fullhistory/diff/525b2358721e/Python/compile.c
I think the reason was to preserve function name when defining multiple functions with the same code in one function or module. (By default function gets it's name from code, but only one code object will be preserved, since all constants in a function are stored in a dict during compilation).

* co_firstlineno was added to cmp (but not hash) in this check-in by Brett:
http://hg.python.org/cpython-fullhistory/rev/8127a55a57cb
In an attempt to fix this bug: http://www.mail-archive.com/python-bugs-list@python.org/msg02440.html
It doesn't actually fix the bug and makes hash inconsistent with cmp. I'm not convinced that the bug is valid -- why would you care if identical lambdas share or not share the code?


Both changes seem come from a tension between code object's original intention to compare "functionally equivalent" codes equal and side-effects of constants de-duplication in a function (loss of function name, broken breakpoints and line numbers in a debugger).
I can think of 2 ways to address this. Change hash/cmp back to ignore co_name and co_firstlineno and then:

1) Never dedup codes, or only dedup codes with the same co_firstlineno (can compare co_name too, but I can't think of a way to create 2 named funcs on the same line). This is pretty much what the current code does.

2) Separate "debug information" (co_filename, co_name, co_firstlineno, co_lnotab) from code object into a separate object. Construct a function from both objects. Allow de-duplication of both. This will always preserve all information in a function, but allows to share code object between identical functions. This is a much more intrusive change, though, e.g. frame will need a reference to debug information.
msg135542 - (view) Author: Eugene Toder (eltoder) * Date: 2011-05-08 21:23
Btw, disabling dedup for codes won't be the first exception -- we already avoid coalescing -0.0 with 0.0 for float and complex, even though they compare equal.
msg180100 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-01-16 18:57
> It doesn't actually fix the bug and makes hash inconsistent with cmp.

The only constraint is that a == b imply hash(a) == hash(b). But the converse doesn't have to be true, i.e. if it perfectly possible to have hash(a) == hash(b) and a != b (pretty much by definition of a hash function, actually).

What Brett's change means is that the hash function isn't as fine-grained as it could be, but I'm not sure it's very important for code objects.

Of course, that Brett's commit doesn't fix the original bug (if confirmed) is a more annoying issue :-)
msg180101 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-01-16 18:58
Woops, I saw Brett unnosied himself, sorry for nosying him by mistake.
msg180106 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2013-01-16 20:20
Making as pending for someone to prove my fix is bad for the problem (since it passes an explicit test), else I will close it in the near-ish future.
msg180117 - (view) Author: Eugene Toder (eltoder) * Date: 2013-01-17 03:41
My comment will make more sense if you follow the links that I provided.

Brett's check-in (http://hg.python.org/cpython-fullhistory/rev/8127a55a57cb) says that it fixes bug #1190011 (http://www.mail-archive.com/python-bugs-list@python.org/msg02440.html). The bug claims that:

   >> def f(): return ((lambda x=1: x), (lambda x=2: x))
   >> f1, f2 = f()
   >> id(f1.func_code) != id(f2.func_code)

   The above does not hold true.  It should according to
   test_compile (reported in HEAD as bug #1048870).

Certainly comparing co_firstlineno can't fix this, as both lambdas are defined at the same line. There are some more comments by Nick Coghlan on that bug, but I can't find anything clearly stating what the problem is believed to be and what the resolution is.

I understand the reason for Brett's fix and why fixing 1190011 literally is not required. The issue was strange debugger/profiler/etc behavior with identical lambdas defined on _different_ lines. Since python tools typically have line resolution (e.g. debugger doesn't know column position of the current statement), merging lambdas defined on the same line is mostly invisible, and does not cause problems.

So to clarify my opening comment -- this is mostly a documentation issue. The only documentation for the behavior of code cmp is the comment in Include/code.h, which says:

   /* The rest doesn't count for hash or comparisons */

Both hash and compare break this in different ways. At a minimum the comment should be fixed. If the current behavior is considered right, this can be it.

My other point is that we can have code objects with different co_firstlineno compare equal and still not have them deduplicated by the compiler (thus avoid problems with debugger). This can be done similar to how float -0.0 is equal to 0.0, but still not deduplicated. Separating these concerns can allow both most logical compare behavior and good support for debugging.

Re hash legally returning same values for non-equal objects -- sure, only a==b implies hash(a)==hash(b), and not the other way round. But this doesn't mean that we should just return 42 from all hash functions. Not including a field in hash that you include compare needs a very good reason. If there's such a reason in this case, please let me know.
msg180159 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2013-01-17 22:38
I have no issue with the current behaviour, so it sounds like the source comment just needs updating. Care to suggest some wording, Eugene?
msg180249 - (view) Author: Eugene Toder (eltoder) * Date: 2013-01-19 18:35
If you add co_firstlineno into code_hash you can say something like

/* The rest isn't used in hash and comparisons, except
   co_name and co_firstlineno, which are preserved for
   tracebacks and debuggers. */

(Otherwise you'd need to explain why co_firstlineno is not in hash.)

BTW, since co_firstlineno is used including co_name is redundant -- there's no way to have 2 named code objects on the same line, AFAIK.
msg216132 - (view) Author: Caelyn McAulay (math_foo) Date: 2014-04-14 17:40
Here is a patch to add the requested documentation to code.h - I expanded it to specify (as per the conversation in this issue) that co_name is used in both hash and comparisons while co_firstlineno is used only in comparisons. 

I do not attempt explain the inconsistency; but let the comment reflect reality.
msg216151 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-04-14 18:20
New changeset 0e35cef1d984 by Andrew Kuchling in branch 'default':
#11983: update comment to describe which fields are used and why.
http://hg.python.org/cpython/rev/0e35cef1d984
msg216152 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2014-04-14 18:21
Thanks for the patch!
History
Date User Action Args
2014-04-14 18:21:26akuchlingsetstatus: open -> closed

assignee: akuchling
versions: + Python 3.5, - Python 3.3
nosy: + akuchling

messages: + msg216152
resolution: fixed
stage: resolved
2014-04-14 18:20:15python-devsetnosy: + python-dev
messages: + msg216151
2014-04-14 17:40:58math_foosetfiles: + comment11983.patch

nosy: + math_foo
messages: + msg216132

keywords: + patch
2013-01-19 18:35:27eltodersetmessages: + msg180249
2013-01-17 22:38:43brett.cannonsetmessages: + msg180159
2013-01-17 03:41:45eltodersetstatus: pending -> open

messages: + msg180117
2013-01-16 20:20:30brett.cannonsetstatus: open -> pending
nosy: + brett.cannon
messages: + msg180106

2013-01-16 18:58:01pitrousetnosy: - brett.cannon
messages: + msg180101
2013-01-16 18:57:27pitrousetnosy: + pitrou, brett.cannon
messages: + msg180100
2013-01-16 18:51:04brett.cannonsetnosy: - brett.cannon
2011-05-08 21:23:36eltodersetmessages: + msg135542
2011-05-08 21:13:12eltodersetnosy: + brett.cannon, ncoghlan, Jeremy.Hylton
messages: + msg135541
2011-05-07 01:07:18eltodersetmessages: + msg135395
2011-05-06 16:41:51eric.araujosetnosy: + eric.araujo
messages: + msg135321
2011-05-03 05:17:30rhettingersetnosy: + rhettinger
2011-05-03 02:11:21eltodercreate