How to remove stop words from a large collection files with more efficient way?

问题

I have 200,000 files for which I've to process and extract tokens for each file. The size of all files is 1.5GB. When I wrote the code for extracting tokens from each file, it works well. Over all execution time is 10mins.

After that, I tried to remove stopwords Performance went down badly. It's taking 25 to 30 mins.

I'm using stop words from the website here There are around 571 stop words. The general procedure is to extract each stop word from a text file at once and compare with each token in the file.

This is the stub of code

StringBuilder sb = new StringBuilder();
for(String s : tokens)
    Scanner sc=new Scanner(new File("stopwords.txt"));
    while(sc.hasNext())
    {
        if(sc.next().equals(s)){
            flag = true;
            break;
        }
    }
    if(flag)
        sb.append(s + "\n" );
    flag = false;
}
String str = sb.toString()

**Ignore errors.

The performance of above code is at least 10 times less than below code. It takes 50 to 60 mins to execute.

StringBuilder sb = new StringBuilder();
String s = tokens.toString();
String str = s.replaceAll("StopWord1|Stopword2|Stopword3|........|LastStopWord"," ");

Performance is far good. This takes 20 to 25 mins.

Is there any better procedure?

回答1:

Of course this is bad. You are doing O(n^2) comparisons. For every word you are comparing with another word. You need to rethink your algorithm.

Read all the stop words in to a HashSet<String> and then just check set.contains(word). This will improve your performance dramatically.

回答2:

You should consider using the Apache Lucene API

It provides functionality for indexing files and removing stopwords, stemming tokens, search, and document similarity based on LSA

来源：https://stackoverflow.com/questions/22257598/how-to-remove-stop-words-from-a-large-collection-files-with-more-efficient-way

标签

java

algorithm

stop-words