This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Multi-process 2to3
Type: performance Stage:
Components: 2to3 (2.x to 3.x conversion tool) Versions:
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: collinwinter Nosy List: amaury.forgeotdarc, benjamin.peterson, collinwinter, nedds, pitrou
Priority: normal Keywords: patch

Created on 2008-07-25 21:06 by nedds, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
multiProcess.diff nedds, 2008-07-25 21:06 Diff for refactor.py changes
multiProcess2.diff nedds, 2008-07-25 21:20 Diff for refactor with situational processing import
Messages (17)
msg70277 - (view) Author: Nick Edds (nedds) Date: 2008-07-25 21:06
Here is a working, multiprocess version of 2to3 with a few caveats.
First, you need to already have the processing module installed for this
to work. If we don't want to include processing in some way, I think I
can modify this to only import processing and use the Process method if
the user wants to run it with more than one process. Also, I know that
logger is supposed to be thread safe, so I am correct in assuming that
this means it is also multi-process safe? This fix delegates the fixing
of files to a designated number of processes, so on a multi-core
machine, it is significantly faster. It may be appropriate to add a test
suite for it, but I have not yet done so. I've tested it on my dual-core
laptop running Ubuntu and a quad-core mac and it appears to be working
fine. I don't know if the use of tempfile was the right choice, but it
seemed reasonable. Another possibility is to instead using a processing
Queue and pass it the output, filename, and input rather than calling
write_file. Also, the join timeout I used is completely arbitrary
because I don't really have a sense of what a reasonable value for it
should be.
msg70280 - (view) Author: Nick Edds (nedds) Date: 2008-07-25 21:20
Here is a version that only imports processing if the multi-process
option is specified. I don't know if this is the most efficient way it
can be done, and I think there's a better way to do it, but this works.
msg74108 - (view) Author: Nick Edds (nedds) Date: 2008-10-01 02:10
Is there still any interest in this Collin? Is there anything else you
need me to do for it?
msg74109 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-10-01 03:08
Nick, is there a way you could isolate the process functionality in a
RefactoringTool subclass? It's an interesting idea, but I don't it needs
to infect the main library.
msg74121 - (view) Author: Collin Winter (collinwinter) * (Python committer) Date: 2008-10-01 09:05
I believe the only issue I recall was that the patch didn't work
out-of-the-box for Python 2.6 (changed imports, PEP 8 compliance changes
in the multiprocess module). Has that been fixed?

I disagree with Benjamin: this is an import speed increase, and I don't
see the point in adding the needless complexity of a subclass. The user
won't notice any difference, and anyone wanting to use this as a library
will want the faster version.
msg74144 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-10-01 20:35
I don't think we should force people using it as a library to go
multiprocess. Also, it's trivial to just change the name of the class
used if that is wanted.
msg74145 - (view) Author: Nick Edds (nedds) Date: 2008-10-01 20:43
The currently attached patch works in Python2.5 not Python2.6, so I will
update it for 2.6 when I get the chance. But as it is currently written,
the default behavior is not multiprocess. Instead, if you want
multiprocess, you specify how many processes you want when you run 2to3
from the command line. And as long as you've got multiple files to run
2to3 on and multiple cores, doing multiprocess 2to3 is significantly faster.
msg74155 - (view) Author: Collin Winter (collinwinter) * (Python committer) Date: 2008-10-02 08:40
"I don't think we should force people using it as a library to go
multiprocess."

I don't understand this. What downsides do you perceive in
multiprocessing support? Multiprocessing is a significant speed-up for
2to3 on multicore systems.
msg74200 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-10-02 22:07
I'm just saying that a client of lib2to3 shouldn't have to use multiple
processes. Just as long as the multiple processes are optional, I'm
happy. :)
msg74226 - (view) Author: Collin Winter (collinwinter) * (Python committer) Date: 2008-10-03 06:52
You have yet to articulate a reason for that preference.
msg74272 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-10-03 21:01
Because I would like to use lib2to3 without the complexity of multiple
processes. That it is a good performance boost is excellent, but I don't
think it should be a required part of using the library.
msg74273 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-10-03 21:04
I think it's reasonable to only enable multiprocessing if the adequate
command-line option has been set. It's how `make` already works (next
time you compile Python, try `make -jN` where N is your number of CPU
cores).
msg74429 - (view) Author: Collin Winter (collinwinter) * (Python committer) Date: 2008-10-07 09:09
Benjamin, what "complexity" did you encounter when trying to use lib2to3
in your own work? Unless there's a concrete use-case where the mere
existence of multiprocessing support (as opposed to actually enabling
that support) made a tangible project harder, I say we leave it in. I
can't imagine someone making the conscious choice to slow down their own
lib2to3-based refactoring tool by 2x; if those people appear later,
we'll deal with their concerns at that time.

Nick: if you don't have time to make this work in 2.6, I can do so. The
patch will also need doc fixes and tests, though the core functionality
is fine.
msg74437 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-10-07 11:38
I suggest that when using lib2to3 as a library, multiprocessing is not
enabled by default; there may be uses of the library that are
incompatible with multiprocessing.

It may be enabled by default when using it from the command line (or the
lib2to3.main module). But which default number of processes would this use?

Concerning the patch itself: 
- the line "from processing import Process" seems suspect.
- Did you consider using something as simple as:
    pool = multiprocessing.Pool(self.options.num_processes)
    pool.map(self.refactor_file, fullnames)
It should do all the job: start processes, queue tasks, wait for results.
msg74486 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-10-07 21:07
I'm not opposed to having the support available. I just don't what it
enabled by default.
msg74615 - (view) Author: Nick Edds (nedds) Date: 2008-10-10 02:57
I had very little experience with the processing module prior to the
creation of this patch, and because pool objects are listed last in the
documentation, I did not read about them because I saw a way to achieve
what I wanted using Process. But having looked at the documentation for
Pool, I think you are correct that it would be a much cleaner solution
to the problem, as I was essentially implementing a Pool myself. I will
post a new patch to reflect this change tomorrow. I will also submit the
patch to take advantage of the fact that the multiprocessing module is
included in Python2.6, as opposed to my prior patch which was designed
for Python2.5 which required the user to get the processing module. And
in the patch, as before, multiprocess will not be default, but it will
be something the user can enable via the command line.
msg85103 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009-04-01 22:41
I added some support for concurrent running in r70999.
History
Date User Action Args
2022-04-11 14:56:36adminsetgithub: 47698
2009-04-01 22:41:23benjamin.petersonsetstatus: open -> closed
resolution: accepted
messages: + msg85103
2008-10-10 02:57:38neddssetmessages: + msg74615
2008-10-07 21:07:38benjamin.petersonsetmessages: + msg74486
2008-10-07 11:38:43amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg74437
2008-10-07 09:09:12collinwintersetmessages: + msg74429
2008-10-03 21:04:42pitrousetnosy: + pitrou
messages: + msg74273
2008-10-03 21:01:27benjamin.petersonsetmessages: + msg74272
2008-10-03 06:52:23collinwintersetmessages: + msg74226
2008-10-02 22:07:50benjamin.petersonsetmessages: + msg74200
2008-10-02 08:40:19collinwintersetmessages: + msg74155
2008-10-01 20:43:58neddssetmessages: + msg74145
2008-10-01 20:35:21benjamin.petersonsetmessages: + msg74144
2008-10-01 09:05:08collinwintersetmessages: + msg74121
2008-10-01 03:08:36benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg74109
2008-10-01 02:10:39neddssetmessages: + msg74108
2008-07-25 21:20:31neddssetfiles: + multiProcess2.diff
messages: + msg70280
2008-07-25 21:06:38neddscreate