I am a newbie using Java to do some data processing on csv files. For that I use the multithreading capabilities of Java (pools of threads) to batch-import the csv files into Ja
For many use cases, multithreading has less overhead than multiprocessing when comparing spawning a thread vs spawning a process as well as comparing communication between threads vs inter-process communication.
However, there are scenarios where multithreading can degrade performance to the point where a single thread outperforms multiple threads, such as cases severely affected by false sharing. With multiprocessing, since each process has its own memory space there is no chance for false sharing to occur and the multiprocessing solution can outperform the multithreading solution.
Overall, some analysis should be conducted when choosing a concurrent programming solution since the best performing solution can vary on a case-to-case basis. Multithreading cannot be assumed to outperform multiprocessing since there are counterintuitive situations where multithreading performs worse than a single thread. When performance is a major consideration, run benchmarks to compare single thread single process vs multithreading vs multiprocessing solutions to ensure you are truly gaining the performance benefits that are expected.
On a quick note, there are other considerations besides performance when choosing a solution.
Each developer should have some understanding about Amdahl's law to understand how the multi processing would speed up based on the given conditions.
Amdahl's law is a model for the relationship between the expected speedup of parallelized implementations of an algorithm relative to the serial algorithm, under the assumption that the problem size remains the same when parallelized.
This is a good read : Amdahl's law
Amdahl's law
The gain is determined by how long it takes to map/reduce the data.
If, for example, the files are loaded on multiple machines to begin with (think of it like sharding the file system), there's no lag getting the data. If the data is coming from a single location, you're limited by that mechanism.
Then the data has to be combined/aggregated-not knowing more, impossible to guess. If all processing depends on having all data, it's a higher hit than if the ultimate results can be calculated independently.
You have a very small number of very small files: unless what you're doing is computationally expensive, I doubt it'd be worth the effort, but it's difficult to say. Assuming no network/disk bottlenecks you'll get a (very) roughly linear speedup with a delta for aggregating results. The true speedup/delta depends on a bunch of factors we don't know much about at this point.
OTOH, you could set up a small Hadoop setup and just try it and see what happens.
Check the docs on your JVM to see if it supports multithreading. I'm pretty sure the sun ones do. Java Concurrency In Practice is the place to start for multithreading.
The first part of your question is: is multiprocessing superior to multithreading, from a performance perspective? In a system with robust multithreading support, threads should always be superior to processes, from a performance perspective. There is more isolation between threads (no shared memory, unless explicitly setup via an IPC mechanism), so you might want to go the multiprocess route to keep dangerous threads from stepping on each other.
For data processing, threads should be the best way to go. If threads on your local machine aren't enough, I would skip past a multiprocess solution and go straight to a map-reduce system like Hadoop.
As to why multiprocess apps are mentioned, I think the author wants to be complete. Although a tutorial is not provided, a link to additional documentation is. The big disadvantage of using multiprocessing is that you have to deal with inter process communication. Unlike threads, you can't just share some memory and throw some mutexes around it and call it a day.
From the comments, it appears that there is some confusion about what "multiprocessing" actually is. Threads are constructs that must be created by your code. There are APIs for thread creation and management. Processes, though, can be created by hand on the command line. On a unix box do the following to run four instances (processes) of foo
. Note that the final &
is required.
$ ./foo & ./foo & ./foo & ./foo &
Now if you have an input file, bar
that foo needs to process, use something like split
to break it up into four equal segments, and run foo
on it:
$ ./foo bar.0 > bar.0.out & ./foo bar.1 > bar.1.out & ./foo bar.2 > bar.2.out & ./foo bar.3 > bar.3.out &
Finally, you will need to combine the bar.?.out
files. Running a test like this should give you some feel for whether using heavy-weight processes is a good idea for your application. If you have already built a multi-threaded application, that will probably be just fine. But feel free to run some experiments to see if processes work better. Once you are sure that processes are the way to go, reorganize your code to use ProcessBuilder to spin up the processes yourself.
There are several ways to start a new process in Java:
ProcessBuilder
Runtime.exec()
With ProcessBuilder
:
ProcessBuilder pb =
new ProcessBuilder("myCommand", "myArg1", "myArg2");
Map<String, String> env = pb.environment();
env.put("VAR1", "myValue");
env.remove("OTHERVAR");
env.put("VAR2", env.get("VAR1") + "suffix");
pb.directory(new File("myDir"));
File log = new File("log");
pb.redirectErrorStream(true);
pb.redirectOutput(Redirect.appendTo(log));
Process p = pb.start();
assert pb.redirectInput() == Redirect.PIPE;
assert pb.redirectOutput().file() == log;
assert p.getInputStream().read() == -1;
With Runtime
:
Runtime r = Runtime.getRuntime();
Process p = r.exec("firefox");
p.waitFor(10, TimeUnit.SECONDS);
p.destroy();
With Apache Commons Exec:
String line = "AcroRd32.exe /p /h " + file.getAbsolutePath();
CommandLine cmdLine = CommandLine.parse(line);
DefaultExecutor executor = new DefaultExecutor();
int exitValue = executor.execute(cmdLine);
Key differences between Multiprocessing and Multithreading from this:
Additional links:
I am curious to know how/whether multiprocessing would speed up the operations even more?
No, in fact it would likely make it worse. If you were to switch from multithreading to multiprocessing, then you would effectively launch the JVM multiple times. Starting up a JVM is no simple effort. In fact, the way the JVM on your desktop machine starts is different from the way an enterprise company starts their JVM, just to reduce wait time for applets to launch for the typical end-user.