How to merge two sets of weka Instances together

问题

Currently, I'm copying one instance at a time from one dataset to the other. Is there a way to do this so that string mappings remain intact? The mergeInstances works horizontally, is there an equivalent vertical merge?

This is one step of a loop I use to read datasets of the same structure from multiple arff files into one large dataset. There has got to be a simpler way.

Instances iNew = new ConverterUtils.DataSource(name).getDataSet();
for (int i = 0; i < iNew.numInstances(); i++) {
    Instance nInst = iNew.instance(i);
    inst.add(nInst);
}

回答1:

Why not make a new ARFF file which has the data from both of the originals? A simple

cat 1.arff > tmp.arff
tail -n+20 2.arff >> tmp.arff

where 20 is replaced by however many lines long your arff header is. This would then produce a new arff file with all of the desired instances, and you could read this new file with your existing code:

Instances iNew = new ConverterUtils.DataSource(name).getDataSet();

You could also invoke weka on the command line using this documentation: http://old.nabble.com/how-to-merge-two-data-file-a.arff-and-b.arff-into-one-data-list--td22890856.html

java weka.core.Instances append filename1 filename2 > output-file

However, there is no function in the documentation http://weka.sourceforge.net/doc.dev/weka/core/Instances.html#main%28java.lang.String which will allow you to append multiple arff files natively within your java code. As of Weka 3.7.6, the code that appends two arff files is this:

     // read two files, append them and print result to stdout
  else if ((args.length == 3) && (args[0].toLowerCase().equals("append"))) {
DataSource source1 = new DataSource(args[1]);
DataSource source2 = new DataSource(args[2]);
String msg = source1.getStructure().equalHeadersMsg(source2.getStructure());
if (msg != null)
  throw new Exception("The two datasets have different headers:\n" + msg);
Instances structure = source1.getStructure();
System.out.println(source1.getStructure());
while (source1.hasMoreElements(structure))
  System.out.println(source1.nextElement(structure));
structure = source2.getStructure();
while (source2.hasMoreElements(structure))
  System.out.println(source2.nextElement(structure));
  }

Thus it looks like Weka itself simply iterates through all of the instances in a data set and prints them, the same process your code uses.

回答2:

If you want a totally fully automated method that also copy properly string and nominal attributes, you can use the following function:

public static Instances merge(Instances data1, Instances data2)
    throws Exception
{
    // Check where are the string attributes
    int asize = data1.numAttributes();
    boolean strings_pos[] = new boolean[asize];
    for(int i=0; i<asize; i++)
    {
        Attribute att = data1.attribute(i);
        strings_pos[i] = ((att.type() == Attribute.STRING) ||
                          (att.type() == Attribute.NOMINAL));
    }

    // Create a new dataset
    Instances dest = new Instances(data1);
    dest.setRelationName(data1.relationName() + "+" + data2.relationName());

    DataSource source = new DataSource(data2);
    Instances instances = source.getStructure();
    Instance instance = null;
    while (source.hasMoreElements(instances)) {
        instance = source.nextElement(instances);
        dest.add(instance);

        // Copy string attributes
        for(int i=0; i<asize; i++) {
            if(strings_pos[i]) {
                dest.instance(dest.numInstances()-1)
                    .setValue(i,instance.stringValue(i));
            }
        }
    }

    return dest;
}

Please note that the following conditions should hold (there are not checked in the function):

Datasets must have the same attributes structure (number of attributes, type of attributes)
Class index has to be the same
Nominal values have to exactly correspond

To modify on the fly the values of the nominal attributes of data2 to match the ones of data1, you can use:

data2.renameAttributeValue(
    data2.attribute("att_name_in_data2"),
    "att_value_in_data2",
    "att_value_in_data1");

回答3:

Another possible solution is to use addAll from java.util.AbstractCollection, since Instances implement it.

instances1.addAll(instances2);

回答4:

I've just shared an extended weka.core.Instaces class with methods like innerJoin, leftJoin, fullJoin, update and union.

table1.makeIndex(table1.attribute("Continent_ID");
table2.makeIndex(table2.attribute("Continent_ID");
Instances result = table1.leftJoin(table2);

Instances can have different number of attributes, levels of NOMINAL and STRING variables are merged together if neccesary.

Sources and some examples are here on GitHub: weka.join.

来源：https://stackoverflow.com/questions/10771558/how-to-merge-two-sets-of-weka-instances-together

标签

java

merge

set

weka