finding a duplicate in a hdf5 pytable with 500e6 rows
Problem I have a large (> 500e6 rows) dataset that I've put into a pytables database. Lets say first column is ID, second column is counter for each ID. each ID-counter combination has to be unique. I have one non-unique row amongst 500e6 rows I'm trying to find. As a starter I've done something like this: index1 = db.cols.id.create_index() index2 = db.cols.counts.create_index() for row in db: query = '(id == %d) & (counts == %d)' % (row['id'], row['counts']) result = th.readWhere(query) if len(result) > 1: print row It's a brute force method I'll admit. Any suggestions on improvements? update