How can I use the index-structures in ELKI?

问题

These are quotes form http://elki.dbs.ifi.lmu.de/ :

"Essentially, we bind the abstract distance query to a database, and then get a nearest neighbor search for this distance. At this point, ELKI will automatically choose the most appropriate kNN query class. If there exist an appropriate index for our distance function (not every index can accelerate every distance!), it will automatically be used here."

"The getKNNForDBID method may boil down to a slow linear scan, but when the database has an appropriate index, the index query will be used. Then the algorithm can run in O(n k log n) or even O(n k) time."

The question is: on what basis do ELKI choose to run the index query or not ?

what is meant by: "when the database has an appropriate index", and how can I guarantee that?

another irrelevant question about the signature of "run" method, why is there 3 signatures instead of just 1? and what are the differences between them, and what are the criteria to determine which signature to use ?

回答1:

There is a howto page on this in the ELKI wiki: http://elki.dbs.ifi.lmu.de/wiki/HowTo/Index

Essentially, you have to add an index using -db.index. It will then be used automatically if the index supports the distance metric. The R*-Tree seems to be the most powerful. There also is a tutorial on adding R-tree indexing support for custom distance functions: http://elki.dbs.ifi.lmu.de/wiki/Tutorial/SpatialDistanceFunctions

As for the second question: there is a run(Database) method in AbstractAlgorithm that uses introspection to check for alternative method signatures. It's a mess, but it is actually convenient to be able to choose one of the signatures. Just make sure, your getInputTypeRestriction() matches. It makes sense when you work with multiple relations. As long as you live in the "everything is a (single) vector" thinking, it seems superfluous; but even then it's convenient to have a run(Database database, Relation<O> relation) signature that already has the data relation to process.

回答2:

This is mostly a follow-up to @Anony-Mousse's post, which is quite on-target.

Indexes need to be added to the database by the user. There current is no automatic indexing (as any index will require extra memory and construction time). -db.index is the parameter for this. Support for automatic indexing is on the wish list, but it requires carefully tuned cost models. On small data set or high dimensional data, or when the user doesn't need this type of queries at all, adding an index will come at a cost.

The database will forward the query request to each index in order. The first index to offer acceleration wins. If no index returns an accelerated query, the database will fall back to a linear scan, unless the hint DatabaseQuery.HINT_OPTIMIZED_ONLY was given. In this case, null will be returned. An linear scan can be forced via QueryUtil, which is mostly useful for unit testing indexes.

M-Trees can work with any numeric distance, but if the distance is not metric the results may be incorrect. An error should be reported if a distance function does not report isMetric() as true.

R-Trees can work with any distance function that implements SpatialPrimitiveDistanceFunction, which essentially means implementing a lower bound point-to-rectangle distance. A lower bound can be found for many distance functions, but effectiveness can vary. For example, angular distances will benefit much less from the rectangular pages the R-tree uses.

As for the run method. The preferred signature for usual vector-space methods is

 YourResultType run(Database database, Relation<V> relation)

As of now, the database can actually be obtained via relation.getDatabase(), but this may change in the future. There is a number of situations where this is problemantic, and some situations where is currently can't be easily removed, unfortunately. Anyway, this is the explicit form, which is convenient to run the algorithms from Java code, i.e. it allows me to specify which relation to use, instead of having to use a database where this is the only appropriate relation (so it gets chosen automatically).

I do have plans to make this even more explicit on the long run, adding explicit support for choosing a data subset to process, and maybe also the queries. The abstract parent run method would then take care of this. An automatic optimizer would rely on this: it would first query all algorithms to be run for their requirements, including query requirements. Based on the queries, data set, memory available etc. the optimizer could then choose appropriate indexes, and pass the algorithm the appropriate query methods. To keep the run signature simple, it will likely be handled via some Instance classes and more use of the factory pattern instead. But don't worry about it now.

If you want to understand why we need this, have a look at e.g. geospatial outlier detection algorithms. The signature used by SLOM for example is:

OutlierResult run(Database database, Relation<N> spatial, Relation<O> relation)

i.e. SLOM uses two two relations. The first relation is the spatial relationship of the instances, e.g. geographic positions. The second relation is the actual data, e.g. measurements. The geographic positions are used to determine which instances are expected to be similar (but these could also be e.g. Polygons!), while the second relation specifies the data that is actually then compared for similarity.

来源：https://stackoverflow.com/questions/19338627/how-can-i-use-the-index-structures-in-elki

标签

database

cluster-analysis

outliers

r-tree

elki