classification
Title: The set implementation should special-case PyUnicode instead of PyString
Type: Stage:
Components: Versions: Python 3.0
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: christian.heimes Nosy List: christian.heimes, gvanrossum, rhettinger
Priority: normal Keywords:

Created on 2007-12-06 21:19 by gvanrossum, last changed 2008-01-06 22:29 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
py3k_optimize_set_unicode2.patch christian.heimes, 2007-12-09 20:51
py3k_optimize_set_unicode3.patch christian.heimes, 2007-12-10 01:05
Messages (14)
msg58256 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2007-12-06 21:34
Much of the code in setobject.c exactly parallels that in dictobject.c.
 Ideally, we should keep that parallelism by scanning all of the changes
to dictobject.c and applying substantially similar changes to
setobject.c (just the changes that touch the hash tables, not the API
changes).
msg58257 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-12-06 22:10
Would it make any sense at all to refactor the code so that code reuse
is automatic?
msg58259 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2007-12-07 01:30
Not really, the implementations are different enough that it would be
*really* hard to keep common code.  The two parallel each other in a way
that is visually easy to translate but hard to do through real
refactoring.  For the most part, both code bases have historically been
very stable, so double maintenance hasn't been much of an issue.  My
last post was meant to suggest the cleanest way to do the Py3.0
string-->unicode switch modeling the set changes after those in dicts.  

FWIW, the reason that I put effort in to keeping the code as parallel as
possible was that it would save us mental clock cycles -- understanding
one implementation gives you most of the other for free.
msg58260 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2007-12-07 01:31
P.S. There is a way to factor-out some common code but it would entail
introducing a bunch of macros that hide the differences between the two.
 I don't think it would be worth it.
msg58311 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-12-09 01:05
I've created a patch that adds optimization for PyUnicode while keeping
the existing optimization for PyString. The patch moves the optimization
trick for PyObject_Hash() into a macro and adds an optimized
_PyUnicode_Eq() to unicodeobject.c
msg58318 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-12-09 15:25
I fixed a bug in the last patch. It now works mixed sets with str and
bytes but it doesn't optimize bytes in set_lookup() any more.
msg58320 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2007-12-09 17:15
The patch doesn't parallel what was done for dicts.  The code in 
dictobject.c does not use a macro.  It special cases for PyUnicode but 
not PyString.  Please submit a patch that mirrors what was done for 
dicts.
msg58328 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-12-09 20:51
Updates:

* Moved dictobject.c:unicode_eq() to unicodeobject.c:_PyUnicode_Eq()
* Added another optimization step to _PyUnicode_Eq(). The hash is
required later anyway and comparing two hashes is much faster than
memcmp-ing the unicode objects.
    if (unicode_hash(a) != unicode_hash(b))
        return 0;
* Factored out the ((PyUnicodeObject *) v)->hash optimization into a
function object.c:_PyObject_HashFast() which does the trick for
PyUnicode and PyString. The trick was used a couple of times in
dictobject.c and setobject.c. We may even think about moving the trick
to PyObject_Hash() directly.
msg58331 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2007-12-09 22:39
Which is the common case in Py3k, to have strings or unicode?  By 
trying to catch both, you slow down the optimization.  Also, the 
new "hash_fast" introduces function call overhead in previously in-
lined code.

My preference is to knock-out the optimization or leave in the way it 
is.  Remember, strings already cache their hash codes and that 
optimization is being short-circuited here.

For sets, I prefer the code to be left as it is.  For dicts, whatever 
you do, don't slow-down the common case of regular attribute lookup.  
This is some of the most time critical code in the language.  Trying to 
generalize the optimization, make actually make it slower.

If unicode strings are the norm and an not PyString_Objects, then 
switch to that one, but I don't think you should try to do both.
msg58333 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-12-10 00:48
Raymond Hettinger wrote:
> Which is the common case in Py3k, to have strings or unicode?  By 
> trying to catch both, you slow down the optimization.  Also, the 
> new "hash_fast" introduces function call overhead in previously in-
> lined code.

PyUnicode is the common case for dict keys in py3k. Attributes etc. are
all PyUnicode instances.

> For sets, I prefer the code to be left as it is.  For dicts, whatever 
> you do, don't slow-down the common case of regular attribute lookup.  
> This is some of the most time critical code in the language.  Trying to 
> generalize the optimization, make actually make it slower.

I wonder how I should store unicode_eq. It's required for dict and set
optimization and I like to keep it static inline. But I can't access
unicode_eq in setobject.c when it is declared as static in dictobject.c.

I could either declare unicode_eq as extern or I could put the method
into a header file which is included by both c files.

> If unicode strings are the norm and an not PyString_Objects, then 
> switch to that one, but I don't think you should try to do both.

Understood!

Christian
msg58335 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2007-12-10 01:42
Okay, then simply go back to the unaltered existing code and replace 
references to PyString with PyUnicode. Skip all the macros, new 
functions, factorings, etc.  Just change strings to unicode and be done 
with it.  IIRC, that was what was done for dicts.  Just do the same for 
sets and not try to get tricky with a new, misnamed C API function like 
PyObject_Hash().
msg58337 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-12-10 02:05
The latest patch does *NOT* add new macros, functions or other stuff. I
simply replaced PyString_* with PyUnicode_* in setobject.c where
appropriate.

The only function I had to factor out is unicode_eq(). It's now in a new
file stringlib/eq.h which is included by setobject.c and dictobject.c.
It's the only way to share the function while keeping it static and
maybe inline.
msg58346 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2007-12-10 15:11
I had looked at the version 2 instead of version 3.

Version 3 is much closer.  A couple of comments.  Don't change the 
brace opening/closing convention in the file -- stick with the K&R 
style -- mixing two different styles makes the file harder to read and 
maintain.

Also, do not introduce string optimization in new places.  If it wasn't 
there before, it was likely left out for a reason (lines 1231 and 1352 
for example).
msg58351 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-12-10 15:52
I've done as you said and committed the changes in r59449.

Next time I won't try to add optimizations without consulting you in the
first place. :] Thanks for your advice.
History
Date User Action Args
2008-01-06 22:29:44adminsetkeywords: - py3k
versions: Python 3.0
2007-12-10 15:52:35christian.heimessetstatus: open -> closed
resolution: fixed
messages: + msg58351
2007-12-10 15:11:03rhettingersetmessages: + msg58346
2007-12-10 02:05:14christian.heimessetmessages: + msg58337
2007-12-10 01:42:32rhettingersetmessages: + msg58335
2007-12-10 01:05:50christian.heimessetfiles: + py3k_optimize_set_unicode3.patch
2007-12-10 00:48:22christian.heimessetmessages: + msg58333
2007-12-09 22:39:11rhettingersetmessages: + msg58331
2007-12-09 20:51:48christian.heimessetfiles: - py3k_optimize_set_unicode.patch
2007-12-09 20:51:42christian.heimessetfiles: + py3k_optimize_set_unicode2.patch
messages: + msg58328
2007-12-09 17:15:33rhettingersetmessages: + msg58320
2007-12-09 15:25:56christian.heimessetfiles: + py3k_optimize_set_unicode.patch
messages: + msg58318
2007-12-09 15:23:53christian.heimessetfiles: - py3k_optimize_set_unicode.patch
2007-12-09 01:05:57christian.heimessetpriority: normal
assignee: christian.heimes
messages: + msg58311
files: + py3k_optimize_set_unicode.patch
nosy: + christian.heimes
2007-12-07 01:31:35rhettingersetmessages: + msg58260
2007-12-07 01:30:01rhettingersetmessages: + msg58259
2007-12-06 22:10:44gvanrossumsetmessages: + msg58257
2007-12-06 21:34:07rhettingersetnosy: + rhettinger
messages: + msg58256
2007-12-06 21:19:11gvanrossumcreate