I am a newbie using Java to do some data processing on csv files. For that I use the multithreading capabilities of Java (pools of threads) to batch-import the csv files into Ja
The gain is determined by how long it takes to map/reduce the data.
If, for example, the files are loaded on multiple machines to begin with (think of it like sharding the file system), there's no lag getting the data. If the data is coming from a single location, you're limited by that mechanism.
Then the data has to be combined/aggregated-not knowing more, impossible to guess. If all processing depends on having all data, it's a higher hit than if the ultimate results can be calculated independently.
You have a very small number of very small files: unless what you're doing is computationally expensive, I doubt it'd be worth the effort, but it's difficult to say. Assuming no network/disk bottlenecks you'll get a (very) roughly linear speedup with a delta for aggregating results. The true speedup/delta depends on a bunch of factors we don't know much about at this point.
OTOH, you could set up a small Hadoop setup and just try it and see what happens.