Parallel or distributed computing - a thing in itself is very nontrivial. Both the development environment should support, and the DS specialist should have the skills to conduct parallel computing, and the task should be reduced to a form that can be divided into parts, if one exists. But with a competent approach, you can greatly accelerate the solution of the problem with single-threaded R, if you have at least a multi-core processor (and almost everyone has it now), adjusted for the theoretical acceleration limit determined by Amdal's law . However, in some cases, even it can be circumvented.
It is a continuation of previous publications .
Typical approach
As a rule, when an analyst (DS specialist, developer, or choose any suitable name for yourself) tries to speed up the task within one computer and begins to move from single-threaded to multi-threaded mode, he does it in a boilerplate manner. parApply
, foreach\%dopar%
, etc. You can see compactly and intelligibly, for example, here: “Parallelism in R” . 3 steps:
- make
core-1
thread - run using
foreach
, - We collect the answers and get the result.
For typical computing tasks that occupy 100% of the CPU and do not require the transfer of a large amount of input information, this is the right approach. The main point that needs attention is to provide logging within the threads in order to be able to control the process. Without logging, the flight will go even without instruments.
In the case of "enterprise" tasks, when they are parallelized, many additional methodological difficulties arise that significantly reduce the effect of the above straightforward approach:
- possible strong unbalance of the load on the flows;
- CPU performance requirements within a single task can be torn with only a few sharp bursts;
- each individual calculation may require a significant amount of memory for input and output of results of a considerable amount;
- as part of a separate task, there may be a mix between computing, working with the disk and querying external systems.
This is a completely typical scenario when, as part of the process, you need to get a voluminous job as an input, read data from the disk, pick up a large chunk from the database, ask for external systems and wait for an answer from them (classic - REST API request), and then return N to the parent process megabytes as a result.
Map-reduce
by users, locations, documents, ip-addresses, dates, ... (add it yourself). In the most sad cases, parallel execution may be longer than single-threaded. Out of memory problems may also occur. Everything is lost? Not at all.
Alternative way
Let us consider the thesis of a way to radically improve the situation. At the same time, we do not forget that we live in the framework of a full zoo. Productive circuit on *nix
, DS laptops on Win * nix \ MacOS, but it is necessary that it works uniformly everywhere.
- The microtask: received the user input, requested a database, requested 2 external ICs via REST, downloaded and parsed the directory from the disk, performed a calculation, dumped the result to disk \ database. Users, for example,
10^6
. - We turn to the use of the
future
package and the universal doFuture
adapter. - If separate tasks are such that within separate tasks processor time is needed in a small amount (we are waiting for answers from third-party systems), then
doFuture
allows you to go from a doFuture
break to a break by separate processes in one line (you can see the startup parameters in *nix
in htop
) . - These processes can be created much more than cores. No clinching will occur as individual processes are in standby mode most of the time. But it will be necessary to experimentally select the optimal number of processes based on the cyclogram of a typical processing process.
Result - the original task is many times faster. Acceleration can be even greater than the number of available cores.
There is no code consciously, since the main task of the publication is to share the approach and the excellent family of future
packages.
PS
There are a few small nuances that also need to be traced:
- each process will consume memory, including received and returned data. An increase in the number of processes will multiply the requirements for available RAM.
doFuture
uses “magic” to automatically determine the composition of variables and packets transferred to the process, but you should not let everything go by itself, it is better to check.- in processes, explicit control of
gc
and explicit removal of variables using rm
will not hurt. This is not a panacea and may not work , but explicitly specifying deleted objects will help. - after the calculation is complete, call
plan(sequential)
. This will close all processes and free up the memory they occupy. - If you need to transfer a large amount of data to the process, consider using an external storage (disk, database). Do not forget that descriptors cannot be transmitted; you must open the source inside the process itself.
Previous publication - “Business processes in enterprise companies: speculation and reality. We shed light with R " .