How to merge two sets of weka Instances together

安稳与你 提交于 2019-12-06 05:44:12

Why not make a new ARFF file which has the data from both of the originals? A simple

cat 1.arff > tmp.arff
tail -n+20 2.arff >> tmp.arff

where 20 is replaced by however many lines long your arff header is. This would then produce a new arff file with all of the desired instances, and you could read this new file with your existing code:

Instances iNew = new ConverterUtils.DataSource(name).getDataSet();

You could also invoke weka on the command line using this documentation: http://old.nabble.com/how-to-merge-two-data-file-a.arff-and-b.arff-into-one-data-list--td22890856.html

java weka.core.Instances append filename1 filename2 > output-file 

However, there is no function in the documentation http://weka.sourceforge.net/doc.dev/weka/core/Instances.html#main%28java.lang.String which will allow you to append multiple arff files natively within your java code. As of Weka 3.7.6, the code that appends two arff files is this:

     // read two files, append them and print result to stdout
  else if ((args.length == 3) && (args[0].toLowerCase().equals("append"))) {
DataSource source1 = new DataSource(args[1]);
DataSource source2 = new DataSource(args[2]);
String msg = source1.getStructure().equalHeadersMsg(source2.getStructure());
if (msg != null)
  throw new Exception("The two datasets have different headers:\n" + msg);
Instances structure = source1.getStructure();
System.out.println(source1.getStructure());
while (source1.hasMoreElements(structure))
  System.out.println(source1.nextElement(structure));
structure = source2.getStructure();
while (source2.hasMoreElements(structure))
  System.out.println(source2.nextElement(structure));
  }

Thus it looks like Weka itself simply iterates through all of the instances in a data set and prints them, the same process your code uses.

mountrix

If you want a totally fully automated method that also copy properly string and nominal attributes, you can use the following function:

public static Instances merge(Instances data1, Instances data2)
    throws Exception
{
    // Check where are the string attributes
    int asize = data1.numAttributes();
    boolean strings_pos[] = new boolean[asize];
    for(int i=0; i<asize; i++)
    {
        Attribute att = data1.attribute(i);
        strings_pos[i] = ((att.type() == Attribute.STRING) ||
                          (att.type() == Attribute.NOMINAL));
    }

    // Create a new dataset
    Instances dest = new Instances(data1);
    dest.setRelationName(data1.relationName() + "+" + data2.relationName());

    DataSource source = new DataSource(data2);
    Instances instances = source.getStructure();
    Instance instance = null;
    while (source.hasMoreElements(instances)) {
        instance = source.nextElement(instances);
        dest.add(instance);

        // Copy string attributes
        for(int i=0; i<asize; i++) {
            if(strings_pos[i]) {
                dest.instance(dest.numInstances()-1)
                    .setValue(i,instance.stringValue(i));
            }
        }
    }

    return dest;
}

Please note that the following conditions should hold (there are not checked in the function):

  • Datasets must have the same attributes structure (number of attributes, type of attributes)
  • Class index has to be the same
  • Nominal values have to exactly correspond

To modify on the fly the values of the nominal attributes of data2 to match the ones of data1, you can use:

data2.renameAttributeValue(
    data2.attribute("att_name_in_data2"),
    "att_value_in_data2",
    "att_value_in_data1");

Another possible solution is to use addAll from java.util.AbstractCollection, since Instances implement it.

instances1.addAll(instances2);

I've just shared an extended weka.core.Instaces class with methods like innerJoin, leftJoin, fullJoin, update and union.

table1.makeIndex(table1.attribute("Continent_ID");
table2.makeIndex(table2.attribute("Continent_ID");
Instances result = table1.leftJoin(table2);

Instances can have different number of attributes, levels of NOMINAL and STRING variables are merged together if neccesary.

Sources and some examples are here on GitHub: weka.join.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!