finding a duplicate in a hdf5 pytable with 500e6 rows

问题

Problem

I have a large (> 500e6 rows) dataset that I've put into a pytables database.

Lets say first column is ID, second column is counter for each ID. each ID-counter combination has to be unique. I have one non-unique row amongst 500e6 rows I'm trying to find.

As a starter I've done something like this:

index1 = db.cols.id.create_index()
index2 = db.cols.counts.create_index()
for row in db:
    query = '(id == %d) & (counts == %d)' % (row['id'],  row['counts'])
    result = th.readWhere(query)
    if len(result) > 1:
        print row

It's a brute force method I'll admit. Any suggestions on improvements?

update

current brute force runtime is 8421 minutes.

solution Thanks for the input everyone. I managed to get the runtime down to 2364.7 seconds using the following method:

ex = tb.Expr('(x * 65536) + y', uservars = {"x":th.cols.id, "y":th.cols.counts})
ex = tb.Expr(expr)
ex.setOutput(th.cols.hash)
ex.eval()
indexrows = th.cols.hash.create_csindex(filters=filters)

ref = None
dups = []
for row in th.itersorted(sortby=th.cols.hash):
  if row['hash'] == ref:
    dups.append(row['hash'] )
  ref = row['hash']

print("ids: ", np.right_shift(np.array(dups, dtype=np.int64), 16))
print("counts: ", np.array(dups, dtype=np.int64) & 65536-1)

I can generate a perfect hash because my maximum values are less than 2^16. I am effectively bit packing the two columns into a 32 bit int.

Once the csindex is generated it is fairly trivial to iterate over the sorted values and do a neighbor test for duplicates.

This method can probably be tweaked a bit, but I'm testing a few alternatives that may provide a more natural solution.

回答1:

Two obvious techniques come to mind: hashing and sorting.

A) define a hash function to combine ID and Counter into a single, compact value.

B) count how often each hash code occurs

C) select from your data all that has hash collissions (this should be a ''much'' smaller data set)

D) sort this data set to find duplicates.

The hash function in A) needs to be chosen such that it fits into main memory, and at the same time provides enough selectivity. Maybe use two bitsets of 2^30 size or so for this. You can afford to have 5-10% collisions, this should still reduce the data set size enough to allow fast in-memory sorting afterwards.

This is essentially a Bloom filter.

回答2:

The brute force approach that you've taken appears to require that you to execute 500e6 queries, one for each row of the table. Although I think that the hashing and sorting approaches suggested in another answer are essentially correct, it's worth noting that pytables is already supposedly built for speed, and should already be expected to have these kinds of techniques effectively included "under the hood", so to speak.

I contend that the simple code you have written most likely does not yet take best advantage of the capabilities that pytables already makes available to you.

In the documentation for create_index(), it says that the default settings are optlevel=6 and kind='medium'. It mentions that you can increase the speed of each of your 500e6 queries by decreasing the entropy of the index, and you can decrease the entropy of your index to its minimum possible value (zero) either by choosing non-default values of optlevel=9 and kind='full', or equivalently, by generating the index with a call to create_csindex() instead. According to the documentation, you have to pay a little more upfront by taking a longer time to create a better optimized index to begin with, but then it pays you back later by saving you time on the series of queries that you have to repeat 500e6 times.

If optimizing your pytables column indices fails to speed up your code sufficiently, and you want to just simply perform a massive sort on all of the rows, and then just search for duplicates by looking for matches in adjacent sorted rows, it's possible to perform a merge sort in O(N log(N)) time using relatively modest amounts of memory by sorting the data in chunks and then saving the chunks in temporary files on disk. Examples here and here demonstrate in principle how to do it in Python specifically. But you should really try optimizing your pytables index first, as that's likely to provide a much simpler and more natural solution in your particular case.

来源：https://stackoverflow.com/questions/20743135/finding-a-duplicate-in-a-hdf5-pytable-with-500e6-rows

标签

python

hdf5

pytables