Author dmalcolm
Recipients Arach, Arfrever, Huzaifa.Sidhpurwala, Jim.Jewett, Mark.Shannon, PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson, christian.heimes, dmalcolm, eric.snow, fx5, georg.brandl, grahamd, gregory.p.smith, gvanrossum, gz, haypo, jcea, lemburg, loewis, mark.dickinson, merwok, neologix, pitrou, skrah, terry.reedy, tim.peters, v+python, zbysz
Date 2012-01-26.22:13:18
SpamBayes Score 1.66533e-16
Marked as misclassified No
Message-id <1327615964.2219.35.camel@surprise>
In-reply-to <CAFRnB2UAcLjHzZ7NJDnzHAU-Gw9urb5Oc5CF==WPKfQEd8q=Lw@mail.gmail.com>
Content
On Thu, 2012-01-26 at 21:04 +0000, Alex Gaynor wrote:
> Alex Gaynor <alex.gaynor@gmail.com> added the comment:
> 
> On Thu, Jan 26, 2012 at 4:00 PM, Martin v. Löwis <report@bugs.python.org>wrote:
> 
> >
> > Martin v. Löwis <martin@v.loewis.de> added the comment:
> >
> > I'd like to propose an entirely different approach: use AVL trees for
> > colliding strings, for dictionaries containing only unicode or byte strings.
> >
> > A prototype for this is in
> > http://hg.python.org/sandbox/loewis/branches#avl
> > It is not fully working yet, but I'm now confident that this is a feasible
> > approach.
> >
> > It has the following advantages over the alternatives:
> > - performance in case of collisions is O(log(N)), where N is the number of
> > colliding keys
> > - no new exceptions are raised, except for MemoryError if it runs out of
> > memory for allocating nodes in the tree
> > - the hash values do not change
> > - the dictionary order does not change as long as no keys collide on hash
> > values (which for all practical purposes should mean that the dictionary
> > order does not change in all places where it matters)
> >
> > ----------
> > nosy: +loewis
> >
> > _______________________________________
> > Python tracker <report@bugs.python.org>
> > <http://bugs.python.org/issue13703>
> > _______________________________________
> >
> 
> Martin,
> 
> What happens if, instead of putting strings in a dictionary directly, I
> have them wrapped in something.  For example, the classes Antoine and I
> pasted early.  These define hash and equal as being strings, but don't have
> an ordering.

[Obviously I'm not Martin, but his idea really interests me]

Looking at:
http://hg.python.org/sandbox/loewis/file/58be269aa0b1/Objects/dictobject.c#l517
as soon as any key insertion or lookup occurs where the key isn't
exactly one of the correct types, the dict flattens any AVL trees back
into the regular flat representation (and switches to lookdict for
ma_lookup), analogous to the existing ma_lookup transition on dicts.

From my reading of the code, if you have a dict purely of bytes/str,
collisions on a hash value lead to the PyDictEntry's me_key being set to
an AVL tree.  All users of the ma_lookup callback within dictobject.c
check to see if they're getting such PyDictEntry back.  If they are,
they call into the tree, which leads to TREE_FIND(), TREE_INSERT() and
TREE_DELETE() invocations as appropriate; ultimately, the AVL macros
call back to within node_cmp():
   PyObject_Compare(left->key, right->key)

[Martin, I'm sorry if I got this wrong]

So *if* I'm reading the code correctly, it might be possible to
generalize it from {str, bytes} to any set of types within which
ordering and equality checking of instances from any type is "sane",
which loosely, would seem to be: that we can reliably compare all
objects from any type within the set, so that we can use the comparisons
to perform a search to hone in on a pair of keys that compare as
"equal", without any chance of raising exceptions, or missing a valid
chance for two objects to be equal etc.

I suspect that you can't plug arbitrary user-defined types into it,
since there's no way to guarantee that ordering and comparison work in
the ways that the AVL lookup requires.

But I could be misreading Martin's code.  [thinking aloud, if a pair of
objects don't implement comparison at the PyObject_Compare level, is it
possible to instead simply compare the addresses of the objects?  I
don't think so, since you have a custom equality implementation in your
UDT, but maybe I've missed something?]

Going higher-level, I feel that there are plenty of attacks against
pure-str/bytes dicts, and having protection against them is worthwhile -
even if there's no direct way to use it to protect the use-case you
describe.

Hope this is helpful
Dave
History
Date User Action Args
2012-01-26 22:13:20dmalcolmsetrecipients: + dmalcolm, lemburg, gvanrossum, tim.peters, loewis, barry, georg.brandl, terry.reedy, gregory.p.smith, jcea, mark.dickinson, pitrou, haypo, christian.heimes, benjamin.peterson, merwok, grahamd, Arfrever, v+python, alex, zbysz, skrah, gz, neologix, Arach, Mark.Shannon, eric.snow, Zhiping.Deng, Huzaifa.Sidhpurwala, Jim.Jewett, PaulMcMillan, fx5
2012-01-26 22:13:19dmalcolmlinkissue13703 messages
2012-01-26 22:13:18dmalcolmcreate