Message 404821 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Stefan Pochmann
Recipients	Stefan Pochmann, rhettinger, tim.peters
Date	2021-10-22.20:27:34
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1634934455.04.0.877462139535.issue45530@roundup.psfhosted.org>
In-reply-to

Content
> It's not surprising the 2nd positions are rarely equal _given_ that the universe has been reduced to the 100 tuples with the same first element. Yes, exactly. > In any case, I don't see the point to this exercise ;-) Just to illustrate why I disagree with what I responded to there. I'm not convinced that many matches at the first tuple position are really an indication that there'll be many matches at the second position as well, so that one should "admit defeat" and that it's "arrogance" to hope for only few matches at the second position. I'd wager it's rather typical that the matching rate at the second position is significantly lower. If you sort people by last and then first name, I think you'd get the same effect as with my above random example, for the same reason. If you sort a set of 2D coordinates, you'll even never get matches at the second position. Even with that SO question's data and its massive duplication (especially at the second position) and its correlation between first and second position (since both "sorted(x)" and "x[::2]" are derived from the same x) and its 43% matching rate at the first position, there's still only 27% matching rate at the second position (lower than the 1/3 break-even point). > BTW, don't overlook that tuple_elem_compare can _only_ be used on the very first elements of the tuples. Yes, that's why I only talked about PyObject_RichCompareBool there. > > 1) Sort the list only by first position (comparing with only > > one tuple_elem_compare). > Not sure what that means, exactly. (the "with only one tuple_elem_compare" part; Just wanted to make clear it wouldn't used two tuple_elem_compare or a PyObject_RichCompareBool. > > 2) Identify equal-first-position-runs (with tuple_elem_compare) > > and sort each run independently (comparing only positions 1+). > What are you aiming at? There was no motivation given here. Much fewer comparisons, especially PyObject_RichCompareBool comparisons. For example for sorting a million pairs with first/second element chosen from a population of 100,000, I get these numbers of comparisons (per element): current groupsort ------------------------------ == first 18.60 none (PyObject_RichCompareBool) < first 16.19 19.60 (tuple_elem_compare) == second 2.41 2.36 (PyObject_RichCompareBool) < second 2.41 2.36 (PyObject_RichCompareBool) ------------------------------ total slow 23.42 4.72 (PyObject_RichCompareBool) total fast 16.19 19.60 (tuple_elem_compare) ------------------------------ total 39.61 24.32 (With "current" I mean before your optimizations, which I can't test yet.) > Will, this issue report is about unsafe_tuple_compare(), so, ya, if the changes you envision aren't limited to that, I'd say you want to open a different issue report and a different PR :-) Ok, I might try that.

> It's not surprising the 2nd positions are rarely equal _given_ that the universe has been reduced to the 100 tuples with the same first element.

Yes, exactly.

> In any case, I don't see the point to this exercise ;-)

Just to illustrate why I disagree with what I responded to there. I'm not convinced that many matches at the first tuple position are really an indication that there'll be many matches at the second position as well, so that one should "admit defeat" and that it's "arrogance" to hope for only few matches at the second position. I'd wager it's rather typical that the matching rate at the second position is significantly lower. If you sort people by last and then first name, I think you'd get the same effect as with my above random example, for the same reason. If you sort a set of 2D coordinates, you'll even *never* get matches at the second position. Even with that SO question's data and its massive duplication (especially at the second position) *and* its correlation between first and second position (since both "sorted(x)" and "x[::2]" are derived from the same x) and its 43% matching rate at the first position, there's still only 27% matching rate at the second position (lower than the 1/3 break-even point).

> BTW, don't overlook that tuple_elem_compare can _only_ be used on the very first elements of the tuples.

Yes, that's why I only talked about PyObject_RichCompareBool there.

> > 1) Sort the list only by first position (comparing with only
> >    one tuple_elem_compare).

> Not sure what that means, exactly. (the "with only one tuple_elem_compare" part;

Just wanted to make clear it wouldn't used two tuple_elem_compare or a PyObject_RichCompareBool.

> > 2) Identify equal-first-position-runs (with tuple_elem_compare)
> > and sort each run independently (comparing only positions 1+).

> What are you aiming at? There was no motivation given here.

Much fewer comparisons, especially PyObject_RichCompareBool comparisons. For example for sorting a million pairs with first/second element chosen from a population of 100,000, I get these numbers of comparisons (per element):

            current  groupsort
------------------------------
== first     18.60      none        (PyObject_RichCompareBool)
 < first     16.19     19.60        (tuple_elem_compare)
== second     2.41      2.36        (PyObject_RichCompareBool)
 < second     2.41      2.36        (PyObject_RichCompareBool)
------------------------------
total slow   23.42      4.72        (PyObject_RichCompareBool)
total fast   16.19     19.60        (tuple_elem_compare)
------------------------------
total        39.61     24.32

(With "current" I mean before your optimizations, which I can't test yet.)

> Will, this issue report is about unsafe_tuple_compare(), so, ya, if the changes you envision aren't limited to that, I'd say you want to open a different issue report and a different PR :-)

Ok, I might try that.

History
Date	User	Action	Args
2021-10-22 20:27:35	Stefan Pochmann	set	recipients: + Stefan Pochmann, tim.peters, rhettinger
2021-10-22 20:27:35	Stefan Pochmann	set	messageid: <1634934455.04.0.877462139535.issue45530@roundup.psfhosted.org>
2021-10-22 20:27:35	Stefan Pochmann	link	issue45530 messages
2021-10-22 20:27:34	Stefan Pochmann	create