问题
I've some duplicated elements in my datastore (not the whole row, but most of the fields on it) in App Engine.
What's the best way to find them?
I've both integer and string fields that are duplicated (in case comparing one is faster than the other).
Thanks!
回答1:
An stupid but quick approach would be to take the fields you care about, concatenate them as a long string and store them as the key of an DB_Unique
entity that references the original entity. Each time you do DB_Unique.get_or_insert()
you should verify the reference is to the correct original entity, otherwise, you have a duplicate. This should probably be done in a map reduce.
Something like:
class DB_Unique(db.Model):
r = db.ReferenceProperty()
class DB_Obj(db.Model):
a = db.IntegerProperty()
b = db.StringProperty()
c = db.StringProperty()
# executed for each DB_Obj...
def mapreduce(entity):
key = '%s_%s_%s' % (entity.a,entity.b,entity.c)
res = DB_Unique.get_or_insert(key, r=entity)
if DB_Unique.r.get_value_for_datastore(res) != entity.key():
# we have a possible collision, verify and delete?
# out two entities are res and entity
There are a couple of edge cases that might creep up, such as if you have two entities with b and c equal to ('a_b', '') and ('a','b_') respectively, so the concatenation is 'a_b_' for both. so use a character you know is not in your strings instead of '_', or have DB_Unique.r
be a list of references and compare all of them.
回答2:
If this is a one time or rarely occurring occasion, you might want to try dumping the whole database into local machine - see uploading and downloading data - load the data into a sqlite3 database and find the duplicate keys with it.
Trying to do this programmatically on the GAE side might turn out quite tedious. With tasks totally doable but not something too easy.
来源:https://stackoverflow.com/questions/4798858/find-duplicates-in-app-engine-datastore