This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author kousu
Recipients kousu, rhettinger, tim.peters
Date 2020-04-01.22:34:51
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1585780492.17.0.42194278579.issue40110@roundup.psfhosted.org>
In-reply-to
Content
Thank you for taking the time to consider my points! Yes, I think you understood exactly what I was getting at.

I slept on it and thought about what I'd posted the day after and realized most of the points you raise, especially that serialized next() would mean serialized processing. So the naive approach is out.

I am motivated by trying to introduce backpressure to my pipelines. The example you gave has potentially infinite memory usage; if I simply slow it down with sleep() I get a memory leak and the main python proc pinning my CPU, even though it "isn't" doing anything:

    with multiprocessing.Pool(4) as pool:
        for i, v in enumerate(pool.imap(worker, itertools.count(1)), 1):
            time.sleep(7)
            print(f"At {i}: {v}, memory usage is {sys.getallocatedblocks()}")

At 1->1, memory usage is 230617
At 2->8, memory usage is 411053
At 3->27, memory usage is 581439
At 4->64, memory usage is 748584
At 5->125, memory usage is 909694
At 6->216, memory usage is 1074304
At 7->343, memory usage is 1238106
At 8->512, memory usage is 1389162
At 9->729, memory usage is 1537830
At 10->1000, memory usage is 1648502
At 11->1331, memory usage is 1759864
At 12->1728, memory usage is 1909807
At 13->2197, memory usage is 2005700
At 14->2744, memory usage is 2067407
At 15->3375, memory usage is 2156479
At 16->4096, memory usage is 2240936
At 17->4913, memory usage is 2328123
At 18->5832, memory usage is 2456865
At 19->6859, memory usage is 2614602
At 20->8000, memory usage is 2803736
At 21->9261, memory usage is 2999129

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
11314 kousu     20   0  303308  40996   6340 S  91.4  2.1   0:21.68 python3.8   
11317 kousu     20   0   54208  10264   4352 S  16.2  0.5   0:03.76 python3.8   
11315 kousu     20   0   54208  10260   4352 S  15.8  0.5   0:03.74 python3.8   
11316 kousu     20   0   54208  10260   4352 S  15.8  0.5   0:03.74 python3.8   
11318 kousu     20   0   54208  10264   4352 S  15.5  0.5   0:03.72 python3.8 

It seems to me like any usage of Pool.imap() either has the same issue lurking or is run on a small finite data set where you are better off using Pool.map().

I like generators because they keep constant-time constant-memory work, which seems like a natural fit for stream processing situations. Unix pipelines have backpressure built-in, because write() blocks when the pipe buffer is full.

I did come up with one possibility after sleeping on it again: run the final iteration in parallel, perhaps by a special plist() method that makes as many parallel next() calls as possible. There's definitely details to work out but I plan to prototype when I have spare time in the next couple weeks.

You're entirely right that it's a risky change to suggest, so maybe it would be best expressed as a package if I get it working. Can I keep this issue open to see if it draws in insights from anyone else in the meantime?

Thanks again for taking a look! That's really cool of you!
History
Date User Action Args
2020-04-01 22:34:52koususetrecipients: + kousu, tim.peters, rhettinger
2020-04-01 22:34:52koususetmessageid: <1585780492.17.0.42194278579.issue40110@roundup.psfhosted.org>
2020-04-01 22:34:52kousulinkissue40110 messages
2020-04-01 22:34:51kousucreate