Essential Tips for your next implementation
Starting with the duo processors in the 90’s personal computers are growing muscular with every next upgrade. Today these are equipped with numerous cores adding the capability to perform simultaneous tasks with no compromise on efficiency of each. As data scientists, we come across a lot of situations where iterative operations like running a big loops, filtering and aggregating data etc. that provide an opportunity where one could potentially utilize the multitasking power of the CPU. The operation could be parallelized hence determine the results faster. I often give the analogy of a supermarket queue, say 100 people queued to get the bills, the more the number of counters the earlier they’ll checkout.
Python provides a great multiprocessing library that is easy to implement. In the recent research work on a few algorithms, I found that loops with large iterations were time consumers at several places. I had engineered them using multiprocessing to reduce the processing time. In the implementation process, I have had a few hiccups. Through this blog, I am sharing some essential no-jargon tips that one should keep in mind when using multiprocessing package in a program. The reader is expected to have some basic knowledge of multiprocessing. Before jumping in, here is a refresher on the basics:
- Parent and Child Process: Multiprocessing usually forms a part of a program. The main program is executed in the parent process, then child processes are triggered (spawned) as the parallel instructions are called forth. Each of the child processes would operate upon the instruction. Ultimately the parent process would collect the results and run the further instructions. The parent process is like a leader that guides the child processes.
- Pool: Pool is like a task handler in multiprocessing that distributes the tasks and then collects them in the main program. It can constrain the number of child processes and facilitate different ways of parallelization
- Are Multi-processing and Multi-threading the same? Conceptually both are methods to parallelize an instruction, but they are NOT the same. The child processes in multiprocessing run completely independent of the others, they do not share memory unlike multithreading where threads mean sub-tasks within the same process and hence, they share memory. It depends on the objective of the parallelization, what among the two is the best but, generally multiprocessing is safer than multithreading.
With those basics, here are a few things that I feel one should keep in mind, please note these are empirical and qualitative in nature:
- Pool provides numerous ways to parallelize an instruction. Primarily these are pool.map/pool.apply and pool.map_async/pool.apply_async, all starters often wonder which one is best to go ahead?
To simplify, consider the following aspects:
- Order of the results should match with the order of the inputs
- The main program should wait until all the processes are done computing.
If any of the above holds true, then pool.map is the choice otherwise either can be used pool.apply_async. The pool.apply_async can also be configured to execute different functions simultaneously
- Multiprocessing has got an overhead. Not always multiprocessing would improve the speed there’s some overhead involved in distribution of tasks and collection of results. Under the hood python stores (pickles) the objects on the disk and regenerates them into individual processes, does the required and write backs so that the main program could consume it. Based on that here are two recommendations:
- If the iterations are really small in number, then it might be best to not invest time on multiprocessing
- Disk I/O is expensive, hence the smaller the size of the function arguments as well as results the faster would be the distribution + collection. Try to optimize the size (in memory units) of the objects. Here the knowledge of python data structures truly helps
- Nested multiprocessing is not possible. There are millions of algorithms that run on nested iterations, someone working on one of those would surely arrive at this question “what if I could start a pool inside a pool and achieve more parallelism?” Think about it! It’s not possible, the answer is NO. Multiprocessing does not allow nested pools because of a few technical constraints (Read about deamon processes)
- It’s important to compulsorily use if _name_ = “_main_” in the parent/main program to protect the entry point of the program. A simple explanation is that all module level code is executed in child process, if the entry point isn’t protected, the program would start making child processes using child processes and it would end into an infinite loop of process creation. The conditional statement makes sure the spawning happens only once.
- It’s always good to design the algorithm in such a way that mapping functions are modular enough to be used both with and without a pool. Python cannot find an exception traceback thrown from a child process, therefore it really becomes difficult to find the point of failure. Having a workflow without pool could help debugging these incidents as well as this could be utilized intelligently as suggested above in point no. 2
I’d have saved hours in testing, debugging and rewriting the code if someone told me about these nuances in multiprocessing before getting started. Some of the lessons are learnt the hard way.
In summary this article provides basic information on multiprocessing with 5 recommendations for a data scientist/developer starting to parallelize the program.
Comment and share with me your experience in multiprocessing!