Message 404947 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	Stefan Pochmann, rhettinger, tim.peters
Date	2021-10-25.03:21:51
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1635132112.06.0.896533266164.issue45530@roundup.psfhosted.org>
In-reply-to

Content
To be concrete, here's an implementation of a full-blown, stable lexicographic sort based on "bucket refinement". `xs` is a list of sequences to be sorted in lexicographic order. The types of the sequences don't matter (lists, tuples, strings, ...). Indeed, since list elements are never compared against each other directly, they don't even have to be the same sequence type. This is already faster in pure Python than list.sort() for cases like: xs = [random.choices(range(100000), k=random.randrange(5, 30)) for i in range(1000000)] However, for cases like strings of the form 'x' * 10_000 + str(i) it's very much slower than list.sort(), despite that it "should be" very much faster. That appears mostly due to the: t.sort(key=lambda x: x[k]) xs[lo : hi] = t lines. list.sort() doesn't accept `lo` and `hi` arguments, so sorting _just_ a slice requires copying that slice into a temp list, sorting the temp, then copying the sorted temp back in. So dealing with a single `x` position in the string prefixes effectively requires physically copying the entire list twice over - mad overhead to copy the list 20 thousand times. While the need doesn't come up all that often, I'd support adding optional `lo` and `hi` arguments to `list.sort()`. This isn't the first time the lack has turned a slick approach into a timing disaster for me ;-) About list.sort()'s merge strategy, I'm glad to be rid of the original. It's not just correctness, it's the effort of _reasoning_ about its consequences. It was about a full decade before the first proof was published of that it really is a worst-case O(N log N) sort. listsort.txt didn't contain a proof because the only proof attempts I had were so complicated not even _I_ found them compelling ;-) Vincent Jugé in particular started at the other end, looking for a merge strategy that made proof straightforward instead of Byzantine. It's straightforward under the "powersort" strategy too, although it relies on "well known" results about approximations to optimal binary search trees. def lexisort(xs): buckets = [(0, len(xs), 0)] while buckets: lo, hi, k = buckets.pop() t = [] for i in range(lo, hi): x = xs[i] if k >= len(x): xs[lo] = x lo += 1 else: t.append(x) t.sort(key=lambda x: x[k]) xs[lo : hi] = t while lo < hi: pivotk = xs[lo][k] i = lo + 1 while i < hi and xs[i][k] == pivotk: i += 1 if i - lo > 1: buckets.append((lo, i, k + 1)) lo = i assert lo == hi

To be concrete, here's an implementation of a full-blown, stable lexicographic sort based on "bucket refinement". `xs` is a list of sequences to be sorted in lexicographic order. The types of the sequences don't matter (lists, tuples, strings, ...). Indeed, since list elements are never compared against each other directly, they don't even have to be the same sequence type.

This is already faster in pure Python than list.sort() for cases like:

    xs = [random.choices(range(100000), k=random.randrange(5, 30))
          for i in range(1000000)]

However, for cases like strings of the form

    'x' * 10_000 + str(i)

it's very much slower than list.sort(), despite that it "should be" very much faster. That appears mostly due to the:

            t.sort(key=lambda x: x[k])
            xs[lo : hi] = t

lines. list.sort() doesn't accept `lo` and `hi` arguments, so sorting _just_ a slice requires copying that slice into a temp list, sorting the temp, then copying the sorted temp back in. So dealing with a single `x` position in the string prefixes effectively requires physically copying the entire list twice over - mad overhead to copy the list 20 thousand times.

While the need doesn't come up all that often, I'd support adding optional `lo` and `hi` arguments to `list.sort()`. This isn't the first time the lack has turned a slick approach into a timing disaster for me ;-)

About list.sort()'s merge strategy, I'm glad to be rid of the original. It's not just correctness, it's the effort of _reasoning_ about its consequences. It was about a full decade before the first proof was published of that it really is a worst-case O(N log N) sort. listsort.txt didn't contain a proof because the only proof attempts I had were so complicated not even _I_ found them compelling ;-)

Vincent Jugé in particular started at the other end, looking for a merge strategy that made proof straightforward instead of Byzantine. It's straightforward under the "powersort" strategy too, although it relies on "well known" results about approximations to optimal binary search trees.

    def lexisort(xs):
        buckets = [(0, len(xs), 0)]
        while buckets:
            lo, hi, k = buckets.pop()
            t = []
            for i in range(lo, hi):
                x = xs[i]
                if k >= len(x):
                    xs[lo] = x
                    lo += 1
                else:
                    t.append(x)
            t.sort(key=lambda x: x[k])
            xs[lo : hi] = t
            while lo < hi:
                pivotk = xs[lo][k]
                i = lo + 1
                while i < hi and xs[i][k] == pivotk:
                    i += 1
                if i - lo > 1:
                    buckets.append((lo, i, k + 1))
                lo = i
            assert lo == hi

History
Date	User	Action	Args
2021-10-25 03:21:52	tim.peters	set	recipients: + tim.peters, rhettinger, Stefan Pochmann
2021-10-25 03:21:52	tim.peters	set	messageid: <1635132112.06.0.896533266164.issue45530@roundup.psfhosted.org>
2021-10-25 03:21:52	tim.peters	link	issue45530 messages
2021-10-25 03:21:51	tim.peters	create