Message 365523 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	kousu
Recipients	kousu, rhettinger, tim.peters
Date	2020-04-01.22:34:51
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1585780492.17.0.42194278579.issue40110@roundup.psfhosted.org>
In-reply-to

Content
Thank you for taking the time to consider my points! Yes, I think you understood exactly what I was getting at. I slept on it and thought about what I'd posted the day after and realized most of the points you raise, especially that serialized next() would mean serialized processing. So the naive approach is out. I am motivated by trying to introduce backpressure to my pipelines. The example you gave has potentially infinite memory usage; if I simply slow it down with sleep() I get a memory leak and the main python proc pinning my CPU, even though it "isn't" doing anything: with multiprocessing.Pool(4) as pool: for i, v in enumerate(pool.imap(worker, itertools.count(1)), 1): time.sleep(7) print(f"At {i}: {v}, memory usage is {sys.getallocatedblocks()}") At 1->1, memory usage is 230617 At 2->8, memory usage is 411053 At 3->27, memory usage is 581439 At 4->64, memory usage is 748584 At 5->125, memory usage is 909694 At 6->216, memory usage is 1074304 At 7->343, memory usage is 1238106 At 8->512, memory usage is 1389162 At 9->729, memory usage is 1537830 At 10->1000, memory usage is 1648502 At 11->1331, memory usage is 1759864 At 12->1728, memory usage is 1909807 At 13->2197, memory usage is 2005700 At 14->2744, memory usage is 2067407 At 15->3375, memory usage is 2156479 At 16->4096, memory usage is 2240936 At 17->4913, memory usage is 2328123 At 18->5832, memory usage is 2456865 At 19->6859, memory usage is 2614602 At 20->8000, memory usage is 2803736 At 21->9261, memory usage is 2999129 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11314 kousu 20 0 303308 40996 6340 S 91.4 2.1 0:21.68 python3.8 11317 kousu 20 0 54208 10264 4352 S 16.2 0.5 0:03.76 python3.8 11315 kousu 20 0 54208 10260 4352 S 15.8 0.5 0:03.74 python3.8 11316 kousu 20 0 54208 10260 4352 S 15.8 0.5 0:03.74 python3.8 11318 kousu 20 0 54208 10264 4352 S 15.5 0.5 0:03.72 python3.8 It seems to me like any usage of Pool.imap() either has the same issue lurking or is run on a small finite data set where you are better off using Pool.map(). I like generators because they keep constant-time constant-memory work, which seems like a natural fit for stream processing situations. Unix pipelines have backpressure built-in, because write() blocks when the pipe buffer is full. I did come up with one possibility after sleeping on it again: run the final iteration in parallel, perhaps by a special plist() method that makes as many parallel next() calls as possible. There's definitely details to work out but I plan to prototype when I have spare time in the next couple weeks. You're entirely right that it's a risky change to suggest, so maybe it would be best expressed as a package if I get it working. Can I keep this issue open to see if it draws in insights from anyone else in the meantime? Thanks again for taking a look! That's really cool of you!

Thank you for taking the time to consider my points! Yes, I think you understood exactly what I was getting at.

I slept on it and thought about what I'd posted the day after and realized most of the points you raise, especially that serialized next() would mean serialized processing. So the naive approach is out.

I am motivated by trying to introduce backpressure to my pipelines. The example you gave has potentially infinite memory usage; if I simply slow it down with sleep() I get a memory leak and the main python proc pinning my CPU, even though it "isn't" doing anything:

    with multiprocessing.Pool(4) as pool:
        for i, v in enumerate(pool.imap(worker, itertools.count(1)), 1):
            time.sleep(7)
            print(f"At {i}: {v}, memory usage is {sys.getallocatedblocks()}")

At 1->1, memory usage is 230617
At 2->8, memory usage is 411053
At 3->27, memory usage is 581439
At 4->64, memory usage is 748584
At 5->125, memory usage is 909694
At 6->216, memory usage is 1074304
At 7->343, memory usage is 1238106
At 8->512, memory usage is 1389162
At 9->729, memory usage is 1537830
At 10->1000, memory usage is 1648502
At 11->1331, memory usage is 1759864
At 12->1728, memory usage is 1909807
At 13->2197, memory usage is 2005700
At 14->2744, memory usage is 2067407
At 15->3375, memory usage is 2156479
At 16->4096, memory usage is 2240936
At 17->4913, memory usage is 2328123
At 18->5832, memory usage is 2456865
At 19->6859, memory usage is 2614602
At 20->8000, memory usage is 2803736
At 21->9261, memory usage is 2999129

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
11314 kousu     20   0  303308  40996   6340 S  91.4  2.1   0:21.68 python3.8   
11317 kousu     20   0   54208  10264   4352 S  16.2  0.5   0:03.76 python3.8   
11315 kousu     20   0   54208  10260   4352 S  15.8  0.5   0:03.74 python3.8   
11316 kousu     20   0   54208  10260   4352 S  15.8  0.5   0:03.74 python3.8   
11318 kousu     20   0   54208  10264   4352 S  15.5  0.5   0:03.72 python3.8 

It seems to me like any usage of Pool.imap() either has the same issue lurking or is run on a small finite data set where you are better off using Pool.map().

I like generators because they keep constant-time constant-memory work, which seems like a natural fit for stream processing situations. Unix pipelines have backpressure built-in, because write() blocks when the pipe buffer is full.

I did come up with one possibility after sleeping on it again: run the final iteration in parallel, perhaps by a special plist() method that makes as many parallel next() calls as possible. There's definitely details to work out but I plan to prototype when I have spare time in the next couple weeks.

You're entirely right that it's a risky change to suggest, so maybe it would be best expressed as a package if I get it working. Can I keep this issue open to see if it draws in insights from anyone else in the meantime?

Thanks again for taking a look! That's really cool of you!

History
Date	User	Action	Args
2020-04-01 22:34:52	kousu	set	recipients: + kousu, tim.peters, rhettinger
2020-04-01 22:34:52	kousu	set	messageid: <1585780492.17.0.42194278579.issue40110@roundup.psfhosted.org>
2020-04-01 22:34:52	kousu	link	issue40110 messages
2020-04-01 22:34:51	kousu	create