How to remove stop words from a large collection files with more efficient way?

馋奶兔 提交于 2019-12-08 09:08:02

问题


I have 200,000 files for which I've to process and extract tokens for each file. The size of all files is 1.5GB. When I wrote the code for extracting tokens from each file, it works well. Over all execution time is 10mins.

After that, I tried to remove stopwords Performance went down badly. It's taking 25 to 30 mins.

I'm using stop words from the website here There are around 571 stop words. The general procedure is to extract each stop word from a text file at once and compare with each token in the file.

This is the stub of code

StringBuilder sb = new StringBuilder();
for(String s : tokens)
    Scanner sc=new Scanner(new File("stopwords.txt"));
    while(sc.hasNext())
    {
        if(sc.next().equals(s)){
            flag = true;
            break;
        }
    }
    if(flag)
        sb.append(s + "\n" );
    flag = false;
}
String str = sb.toString()

**Ignore errors.

The performance of above code is at least 10 times less than below code. It takes 50 to 60 mins to execute.

StringBuilder sb = new StringBuilder();
String s = tokens.toString();
String str = s.replaceAll("StopWord1|Stopword2|Stopword3|........|LastStopWord"," ");

Performance is far good. This takes 20 to 25 mins.

Is there any better procedure?


回答1:


Of course this is bad. You are doing O(n^2) comparisons. For every word you are comparing with another word. You need to rethink your algorithm.

Read all the stop words in to a HashSet<String> and then just check set.contains(word). This will improve your performance dramatically.




回答2:


You should consider using the Apache Lucene API

It provides functionality for indexing files and removing stopwords, stemming tokens, search, and document similarity based on LSA



来源:https://stackoverflow.com/questions/22257598/how-to-remove-stop-words-from-a-large-collection-files-with-more-efficient-way

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!