Running clustering algorithms in ELKI

无人久伴 提交于 2019-12-03 21:02:40

We do appreciate documentation contributions! (Update: I have turned this post into a new ELKI tutorial entry for now.)

ELKI does advocate to not embed it in other applications Java for a number of reasons. This is why we recommend using the MiniGUI (or the command line it constructs). Adding custom code is best done e.g. as a custom ResultHandler or just by using the ResultWriter and parsing the resulting text files.

If you really want to embed it in your code (there are a number of situations where it is useful, in particular when you need multiple relations, and want to evaluate different index structures against each other), here is the basic setup for getting a Database and Relation:

// Setup parameters:
ListParameterization params = new ListParameterization();
params.addParameter(FileBasedDatabaseConnection.INPUT_ID, filename);
// Add other parameters for the database here!

// Instantiate the database:
Database db = ClassGenericsUtil.parameterizeOrAbort(
    StaticArrayDatabase.class,
    params);
// Don't forget this, it will load the actual data...
db.initialize();

Relation<DoubleVector> vectors = db.getRelation(TypeUtil.DOUBLE_VECTOR_FIELD);
Relation<LabelList> labels = db.getRelation(TypeUtil.LABELLIST);

If you want to program more general, use NumberVector<?>.

Why we do (currently) not recommend using ELKI as a "library":

  1. The API is still changing a lot. We keep on adding options, and we cannot (yet) provide a stable API. The command line / MiniGUI / Parameterization is much more stable, because of the handling of default values - the parameterization only lists the non-default parameters, so only if these change you'll notice.

    In the code example above, note that I also used this pattern. A change to the parsers, database etc. will likely not affect this program!

  2. Memory usage: data mining is quite memory intensive. If you use the MiniGUI or command line, you have a good cleanup when the task is finished. If you invoke it from Java, changes are really high that you keep some reference somewhere, and end up leaking lots of memory. So do not use above pattern without ensuring that the objects are properly cleaned up when you are done!

    By running ELKI from the command line, you get two things for free:

    1. no memory leaks. When the task is finished, the process quits and frees all memory.

    2. no need to rerun it twice for the same data. Subsequent analysis does not need to rerun the algorithm.

  3. ELKI is not designed as embeddable library for good reasons. ELKI has tons of options and functionality, and this comes at a price, both in runtime (although it can easily outperform R and Weka, for example!) memory usage and in particular in code complexity. ELKI was designed for research in data mining algorithms, not for making them easy to include in arbitrary applications. Instead, if you have a particular problem, you should use ELKI to find out which approach works good, then reimplement that approach in an optimized manner for your problem.

Best ways of using ELKI

Here are some tips and tricks:

  1. Use the MiniGUI to build a command line. Note that in the logging window of the "GUI" it shows the corresponding command line parameters - running ELKI from command line is easy to script, and can easily be distributed to multiple computers e.g. via Grid Engine.

    #!/bin/bash
    for k in $( seq 3 39 ); do
        java -jar elki.jar KDDCLIApplication \
            -dbc.in whatever \
            -algorithm clustering.kmeans.KMedoidsEM \
            -kmeans.k $k \
            -resulthandler ResultWriter -out.gzip \
            -out output/k-$k 
    done
    
  2. Use indexes. For many algorithms, index structures can make a huge difference! (But you need to do some research which indexes can be used for which algorithms!)

  3. Consider using the extension points such as ResultWriter. It may be the easiest for you to hook into this API, then use ResultUtil to select the results that you want to output in your own preferred format or analyze:

    List<Clustering<? extends Model>> clusterresults =
        ResultUtil.getClusteringResults(result);
    
  4. To identify objects, use labels and a LabelList relation. The default parser will do this when it sees text along the numerical attributes, i.e. a file such as

    1.0 2.0 3.0 ObjectLabel1
    

    will make it easy to identify the object by its label!

UPDATE: See ELKI tutorial created out of this post for updates.

ELKI's documentation is pretty sparse (I don't know why they don't include a simple "hello world" program in the examples)

You could try Java-ML. Its documentation is a bit more user friendly, and it does have K-medoid.

Clustering example with Java-ML | http://java-ml.sourceforge.net/content/clustering-basics

K-medoid | http://java-ml.sourceforge.net/api/0.1.7/

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!