This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: File-level, optionally external sorting
Type: enhancement Stage: resolved
Components: Versions: Python 3.10
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: Dennis Sweeney, pablogsal, platon.work, rhettinger
Priority: normal Keywords:

Created on 2020-08-31 16:01 by platon.work, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
disksort.py Dennis Sweeney, 2020-09-01 06:31 External sort for iterables proof of concept
Messages (9)
msg376157 - (view) Author: Platon workaccount (platon.work) Date: 2020-08-31 16:01
Feature request: a convenient sorter of whole files with the possibility of disk usage. Here's the syntax in my mind:

shutil.sort(src, dst, headers=0, key=None, reverse=False, allow_disk_use=False)
msg376168 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-08-31 21:18
What do you refer when you say "sorting a file"?

What does "key" act upon? Strings representing the lines in the file?

For allow_disk_use=False, what's the difference between opening the file, reading the lines, using sort() and writing the contents?
msg376174 - (view) Author: Platon workaccount (platon.work) Date: 2020-08-31 23:55
I mean Python's analog of sort [-k x.y] table.txt from GNU Coreutils.

>> What do you refer when you say "sorting a file"?

Sorting a file with multi-line plain text. Optionally, text consisting of
several columns separated by a specific character.

>> What does "key" act upon? Strings representing the lines in the file?

This is a sort rule argument similar to that of the existing in-memory
sort()/sorted() method.

>> For allow_disk_use=False, what's the difference between opening the
file, reading the lines, using sort() and writing the contents?

If False, there is no difference.

вт, 1 сент. 2020 г. в 00:18, Pablo Galindo Salgado <report@bugs.python.org>:

>
> Pablo Galindo Salgado <pablogsal@gmail.com> added the comment:
>
> What do you refer when you say "sorting a file"?
>
> What does "key" act upon? Strings representing the lines in the file?
>
> For allow_disk_use=False, what's the difference between opening the file,
> reading the lines, using sort() and writing the contents?
>
> ----------
> nosy: +pablogsal
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue41678>
> _______________________________________
>
msg376178 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2020-09-01 03:22
If we were to do this, I think a better API might be to accept an arbitrary iterable, then produce a sorted iterable:

def sorted_on_disk(iterable, key=None, reverse=False) -> Iterable:
    ...

It would sort chunks of the input and store them in files as sequences of pickles, merging them as they got bigger, and then return an iterator over the resulting single sorted file.

This would be more composable with other standard Python functions and would be a good way of separating concerns. sorted(...) and heapq.merge(...) already have the correct APIs to do it this way.

Potential implementation detail: For some small fixed n, always doing a 2^n-way heapq.merge instead of a bunch of 2-way merges would do fewer passes over the data and would allow the keys to be computed 1/n as many times, assuming we wouldn't decorate-sort-undecorate.
msg376179 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2020-09-01 06:31
Attached is a proof of concept.
msg376332 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-09-03 23:41
This doesn't seem like something that should be in the standard library.  It is more of an application than a building block for writing code.  It is a good candidate for an external package on PyPI rather than the standard library.
msg376333 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-09-03 23:44
I am of the same opinion as Raymond
msg376334 - (view) Author: Platon workaccount (platon.work) Date: 2020-09-04 00:18
Why is shutil.make_archive suitable for a standard library but the file sorter not?
msg376391 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-09-04 17:45
Thanks for the suggestion, but Pablo and I agree that this isn't within scope for the standard library.  Marking as closed.

If you want to discuss further, please post to the Python ideas list.
History
Date User Action Args
2022-04-11 14:59:35adminsetgithub: 85844
2020-09-04 17:45:27rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg376391

stage: resolved
2020-09-04 00:18:25platon.worksetmessages: + msg376334
2020-09-03 23:44:50pablogsalsetmessages: + msg376333
2020-09-03 23:41:55rhettingersetnosy: + rhettinger
messages: + msg376332
2020-09-01 06:31:49Dennis Sweeneysetfiles: + disksort.py

messages: + msg376179
2020-09-01 03:22:16Dennis Sweeneysetnosy: + Dennis Sweeney
messages: + msg376178
2020-08-31 23:55:19platon.worksetmessages: + msg376174
2020-08-31 21:18:15pablogsalsetnosy: + pablogsal
messages: + msg376168
2020-08-31 16:01:22platon.workcreate