Message 404715 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	Stefan Pochmann, rhettinger, tim.peters
Date	2021-10-22.01:24:10
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1634865851.58.0.742587001536.issue45530@roundup.psfhosted.org>
In-reply-to

Content
> I see you mentioned that PyObject_RichCompareBool(..., Py_EQ) might be > faster for example because it checks identity. For example, in tupsort.py replace xs = [random() for _ in range(length)] with xs = ['z' * 100 for _ in range(length)] Then sorting that directly is _slower_ than wrapping each string in a 1-tuple first. That's so today, and also so in the latest version of the PR (but not in the original PR code). > Why not do an identity check before the ms->tuple_elem_compare calls? Too > expensive and rarely successful? It's a cheap test, but, ya, "rarely successful" outside of special cases. PyObject_RichCompareBool() already does that special-casing now, and the PR now adjusts to use PyObject_RichCompareBool(Py_EQ) instead when the pair of ms->tuple_elem_compare calls isn't paying off. It's enough that one part of the chain rarely pays off, but gets nearly all the potential benefit when it does pay off. Duplicating the special-casing would double the overhead with scant additional payoff when it pays. > I sorted a million tuples where both first and second element > are randomly chosen from a population of 10,000. So you expect about 100 repetitions of each value in each tuple position. > So their amounts of duplication were the same. But these are the > statistics from sorting: > - First position: 18,603,981 equality comparisons, 29.87% equal > - Second position: 5,556,203 equality comparisons, 0.09% equal Well, for any given fixed value in the first tuple position, you expect about 100 tuples total to have that value in the first position, and the second position in those contains a random sample (with repetition) of 100 elements out of 10,000. It's not surprising the 2nd positions are rarely equal _given_ that the universe has been reduced to the 100 tuples with the same first element. In any case, I don't see the point to this exercise ;-) BTW, don't overlook that tuple_elem_compare can _only_ be used on the very first elements of the tuples. The pre-scan developed no idea whatsoever of the types of tuple elements other than the first. > One more idea: What do you think of sorting lists of tuples > (or tuple keys) in two stages? Primarily that I'm not looking for a term project here - I'm looking to get an easy win from simple changes to one function. What's there in the PR now is designed to play well with how sorting works. When there are many duplicates, a merge will find about 7 comparison attempts in a row where the first elements are equal, and the code adjusts to use PyObject_RichCompareBool(Py_EQ) for at least all but the first compare. Galloping mode will continue with that for a brief time at the start, but probably soon hit a chain of cases all of which can be resolved solely with tuple_elem_compare calls, and the code adjusts to that too, never using PyObject_RichCompareBool(Py_EQ) again after the first time it sees that return "no, not equal". OTOH, when there are no duplicates, PyObject_RichCompareBool(Py_EQ) isn't helpful, and in fact will never be called. > 1) Sort the list only by first position (comparing with only one > tuple_elem_compare). Not sure what that means, exactly. (the "with only one tuple_elem_compare" part; I know what it means to sort by the first position) > 2) Identify equal-first-position-runs (with tuple_elem_compare) and > sort each run independently (comparing only positions 1+). What are you aiming at? There was no motivation given here. > Some experiments I did with this looked good, not sure if too > off-topic to post here... Will, this issue report is about unsafe_tuple_compare(), so, ya, if the changes you envision aren't limited to that, I'd say you want to open a different issue report and a different PR :-)

> I see you mentioned that PyObject_RichCompareBool(..., Py_EQ) might be
> faster for example because it checks identity.

For example, in tupsort.py replace

    xs = [random() for _ in range(length)]

with

    xs = ['z' * 100 for _ in range(length)]

Then sorting that directly is _slower_ than wrapping each string in a 1-tuple first. That's so today, and also so in the latest version of the PR (but not in the original PR code).

> Why not do an identity check before the ms->tuple_elem_compare calls? Too
> expensive and rarely successful?

It's a cheap test, but, ya, "rarely successful" outside of special cases. PyObject_RichCompareBool() already does that special-casing now, and the PR now adjusts to use PyObject_RichCompareBool(Py_EQ) instead when the pair of ms->tuple_elem_compare calls isn't paying off. It's enough that one part of the chain rarely pays off, but gets nearly all the potential benefit when it does pay off. Duplicating the special-casing would double the overhead with scant additional payoff when it pays.

> I sorted a million tuples where both first and second element
> are randomly chosen from a population of 10,000.

So you expect about 100 repetitions of each value in each tuple position.

> So their amounts of duplication were the same. But these are the
> statistics from sorting:
> - First position:   18,603,981 equality comparisons, 29.87% equal
> - Second position:   5,556,203 equality comparisons,  0.09% equal

Well, for any given fixed value in the first tuple position, you expect about 100 tuples total to have that value in the first position, and the second position in those contains a random sample (with repetition) of 100 elements out of 10,000. It's not surprising the 2nd positions are rarely equal _given_ that the universe has been reduced to the 100 tuples with the same first element.

In any case, I don't see the point to this exercise ;-)

BTW, don't overlook that tuple_elem_compare can _only_ be used on the very first elements of the tuples. The pre-scan developed no idea whatsoever of the types of tuple elements other than the first.


> One more idea: What do you think of sorting lists of tuples
> (or tuple keys) in two stages?

Primarily that I'm not looking for a term project here - I'm looking to get an easy win from simple changes to one function.

What's there in the PR now is designed to play well with how sorting works. When there are many duplicates, a merge will find about 7 comparison attempts in a row where the first elements are equal, and the code adjusts to use PyObject_RichCompareBool(Py_EQ) for at least all but the first compare. Galloping mode will continue with that for a brief time at the start, but probably soon hit a chain of cases all of which can be resolved solely with tuple_elem_compare calls, and the code adjusts to that too, never using PyObject_RichCompareBool(Py_EQ) again after the first time it sees that return "no, not equal".

OTOH, when there are no duplicates, PyObject_RichCompareBool(Py_EQ) isn't helpful, and in fact will never be called.

> 1) Sort the list only by first position (comparing with only one
>    tuple_elem_compare).

Not sure what that means, exactly. (the "with only one tuple_elem_compare" part; I know what it means to sort by the first position)

> 2) Identify equal-first-position-runs (with tuple_elem_compare) and
> sort each run independently (comparing only positions 1+).

What are you aiming at? There was no motivation given here.

> Some experiments I did with this looked good, not sure if too
> off-topic to post here...

Will, this issue report is about unsafe_tuple_compare(), so, ya, if the changes you envision aren't limited to that, I'd say you want to open a different issue report and a different PR :-)

History
Date	User	Action	Args
2021-10-22 01:24:11	tim.peters	set	recipients: + tim.peters, rhettinger, Stefan Pochmann
2021-10-22 01:24:11	tim.peters	set	messageid: <1634865851.58.0.742587001536.issue45530@roundup.psfhosted.org>
2021-10-22 01:24:11	tim.peters	link	issue45530 messages
2021-10-22 01:24:10	tim.peters	create