问题
I am working with a data-set of dimension more than 10,000. To use Weka I need to convert text file into ARFF format, but since there are too many attributes even after using sparse ARFF format file size is too large. Is there any similar method as for data to avoid writing so many attribute identifier as in header of ARFF file.
for example :
@attribute A1 NUMERICAL
@attribute A2 NUMERICAL
...
...
@attribute A10000 NUMERICAL
回答1:
I coded a script in AWK to format the following lines (in a TXT file) to an ARFF
example.txt source:
Att_0 | Att_1 | Att_2 | ... | Att_n
1 | 2 | 3 | ... | 999
My script (to_arff), you can change FS value depending on the separator used in the TXT file:
#!/usr/bin/awk -f
# ./<script>.awk data.txt > data.arff
BEGIN {
FS = "|";
# WEKA separator
separator = ",";
}
# The first line
NR == 1 {
# WEKA headers
split(FILENAME, relation, ".");
# the relation's name is the source file's name
print "@RELATION "relation[1]"\n";
# attributes are "numeric" by default
# types available: numeric, <nominal> {n1, n2, ..., nN}, string and date [<date-format>]
for (i = 1; i <= NF; i++) {
print "@ATTRIBUTE "$i" NUMERIC";
}
print "\n@DATA";
}
NR > 1 {
s = "";
first = 1;
for (i = 1; i <= NF; i++) {
if (first)
first = 0;
else
s = s separator;
s = s $i;
}
print s;
}
Output:
@RELATION example
@ATTRIBUTE Att_0 NUMERIC
@ATTRIBUTE Att_1 NUMERIC
@ATTRIBUTE Att_2 NUMERIC
@ATTRIBUTE Att_n NUMERIC
@DATA
1,2,3,9999
来源:https://stackoverflow.com/questions/9234232/too-many-attributes-for-arff-format-in-weka