How to remove duplicate words using Java when words are more than 200 million?

问题

I have a file (size = ~1.9 GB) which contains ~220,000,000 (~220 million) words / strings. They have duplication, almost 1 duplicate word every 100 words.

In my second program, I want to read the file. I am successful to read the file by lines using BufferedReader.

Now to remove duplicates, we can use Set (and it's implementations), but Set has problems, as described following in 3 different scenarios:

With default JVM size, Set can contain up to 0.7-0.8 million words, and then OutOfMemoryError.
With 512M JVM size, Set can contain up to 5-6 million words, and then OOM error.
With 1024M JVM size, Set can contain up to 12-13 million words, and then OOM error. Here after 10 million records addition into Set, operations become extremely slow. For example, addition of next ~4000 records, it took 60 seconds.

I have restrictions that I can't increase the JVM size further, and I want to remove duplicate words from the file.

Please let me know if you have any idea about any other ways/approaches to remove duplicate words using Java from such a gigantic file. Many Thanks :)

Addition of info to question: My words are basically alpha-numeric and they are IDs which are unique in our system. Hence they are not plain English words.

回答1:

Use merge sort and remove the duplicates in a second pass. You could even remove the duplicates while merging (just keep the latest word added to output in RAM and compare the candidates to it as well).

回答2:

Divide the huge file into 26 smaller files based on the first letter of the word. If any of the letter files are still too large, divide that letter file by using the second letter.

Process each of the letter files separately using a Set to remove duplicates.

回答3:

You might be able to use a trie data structure to do the job in one pass. It has advantages that recommend it for this type of problem. Lookup and insert are quick. And its representation is relatively space efficient. You might be able to represent all of your words in RAM.

回答4:

If you sort the items, duplicates will be easy to detect and remove, as the duplicates will bunch together.

There is code here you could use to mergesort the large file: http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194

回答5:

For large files I try not to read the data into memory but instead operate on a memory mapped file and let the OS page in/out memory as needed. If your set structures contain offsets into this memory mapped file instead of the actual strings it would consume significantly less memory.

Check out this article:

http://javarevisited.blogspot.com/2012/01/memorymapped-file-and-io-in-java.html

回答6:

Question: Are these really WORDS, or are they something else -- phrases, part numbers, etc?

For WORDS in a common spoken language one would expect that after the first couple of thousand you'd have found most of the unique words, so all you really need to do is read a word in, check it against a dictionary, if found skip it, if not found add it to the dictionary and write it out.

In this case your dictionary is only a few thousand words large. And you don't need to retain the source file since you write out the unique words as soon as you find them (or you can simply dump the dictionary when you're done).

回答7:

If you have the posibility to insert the words in a temporary table of a database (using batch inserts), then it would be a select distinct towards that table.

回答8:

One classic way to solve this kind of problem is a Bloom filter. Basically you hash your word a number of times and for each hash result set some bits in a bit vector. If you're checking a word and all the bits from its hashes are set in the vector you've probably (you can set this probability arbitrarily low by increasing the number of hashes/bits in the vector) seen it before and it's a duplicate.

This was how early spell checkers worked. They knew if a word was in the dictionary, but they couldn't tell you what the correct spelling was because it only tell you if the current word is seen.

There are a number of open source implementations out there including java-bloomfilter

回答9:

I'd tackle this in Java the same way as in every other language: Write a deduplication filter and pipe it as often as necessary.

This is what I mean (in pseudo code):

Input parameters: Offset, Size
Allocate searchable structure of size Size (=Set, but need not be one)
Read Offset (or EOF is encountered) elements from stdin and just copy them to stdout
Read Size elments from stdin (or EOF), store them in Set. If duplicate, drop, else write to stdout.
Read elements from stdin until EOF, if they are in Set then drop, else write to stdout

Now pipe as many instances as you need (If storage is no problem, maybe only as many as you have cores) with increasing Offsets and sane Size. This lets you use more cores, as I suspect the process is CPU bound. You can even use netcat and spread processing over more machines, if you are in a hurry.

回答10:

Even in English, which has a huge number of words for a natural language, the upper estimates are only about 80000 words. Based on that, you could just use a HashSet and add all your words it (probably in all lower case to avoid case issues):

Set<String> words = new HashSet<String>();
while (read-next-word) {
    words.add(word.toLowerCase());
}

If they are real words, this isn't going to cause memory problems, will will be pretty fast too!

回答11:

To not have to worry to much about implementation you should use a database system, either plain old relational SQL or a No-SQL solution. Im pretty sure you could use e.g. Berkeley DB java edition and then do (pseudo code)

for(word : stream) {
  if(!DB.exists(word)) {
     DB.put(word)
     outstream.add(word)
  }
}

The problem is in essence easy, you need to store things on disk because there is not enough memory, then either use sorting O(N log N) (unecessary) or hashing O(N) to find the unique words.

If you want a solution that will very likely work but is not guaranteed to do so use a LRU type hash table. According to the empirical Zpif's law you should be OK.

A follow up question to some smart guy out there, what if I have 64-bit machine and set heap size to say 12GB, shouldn't virtual memory take care of the problem (although not in an optimal way) or is java not designed this way?

回答12:

Quicksort would be a good option over Mergesort in this case because it needs less memory. This thread has a good explanation as to why.

回答13:

Most performant solutions arise from omiting unecessary stuff. You look only for duplicates, so just do not store words itself, store hashes. But wait, you are not interested in hashes either, only if they awere seen already - do not store them. Treat hash as really large number, and use bitset to see whether you already seen this number.

So your problem boils down to really big sparse populated bitmap - with size depending on hash width. If your hash is up to 32 bit, you can use riak bitmap.

... gone thinking about really big bitmap for 128+ bit hashes %) (I'll be back )

来源：https://stackoverflow.com/questions/12501112/how-to-remove-duplicate-words-using-java-when-words-are-more-than-200-million

标签

java

duplicate-removal