Improving performance of preprocessing large set of documents

喜夏-厌秋 提交于 2019-12-02 22:33:20

问题


I am working on a project related to plagiarism detection framework using Java. My document set contains about 100 documents and I have to preprocess them and store in a suitable data structure. I have a big question that how am i going to process the large set of documents efficiently and avoiding bottlenecks . The main focus on my question is how to improve the preprocessing performance.

Thanks

Regards Nuwan


回答1:


You're a bit lacking on specifics there. Appropriate optimizations are going to depend upon things like the document format, the average document size, how you are processing them, and what sort of information you are storing in your data structure. Not knowing any of them, some general optimizations are:

  1. Assuming that the pre-processing of a given document is independent of the pre-processing of any other document, and assuming you are running a multi-core CPU, then your workload is a good candidate for multi-threading. Allocate one thread per CPU core, and farm out jobs to your threads. Then you can process multiple documents in parallel.

  2. More generally, do as much in memory as you can. Try to avoid reading from/writing to disk as much as possible. If you must write to disk, try to wait until you have all the data you want to write, and then write it all in a single batch.




回答2:


You give very little information on which to make any good suggestions.

My default would be to process them using an executor with a thread pool with the same number of threads as cores in your machine each thread processing a document.



来源:https://stackoverflow.com/questions/5691437/improving-performance-of-preprocessing-large-set-of-documents

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!