weka | 易学教程

How to represent text for classification in weka?

阅读更多关于 How to represent text for classification in weka?

问题 Can you please let me know how to represent attribute or class for text classification in weka. By using what attribute can I do classification? word frequency or just word? What would be possible structure of ARFF format? Can you give me several lines of example of that structure? Thank you very much in advance. 回答1: One of the easiest alternatives is to start with an ARFF file for a two class problem like: @relation corpus @attribute text string @attribute class {pos,neg} @data 'long text

Load a file in Resources with FileInputStream

阅读更多关于 Load a file in Resources with FileInputStream

问题 I know the safe way to open a file in the resources is: InputStream is = this.getClass().getResourceAsStream("/path/in/jar/file.name"); now the problem is that my file is a model for a decider in the Weka Wrapper package and the Decider class has only a method: public void load(File file) throws Exception load takes the file and opens it as a FileInputStream. Do you see a workaround? I really would like to ship the model putting it in the resources. I was thinking to create a temporary file,

How to cluster an instance with Weka's DBSCAN?

阅读更多关于 How to cluster an instance with Weka's DBSCAN?

问题 I've been trying to use the DBSCAN clusterer from Weka to cluster instances. From what I understand I should be using the clusterInstance() method for this, but to my surprise, when taking a look at the code of that method, it looks like the implementation ignores the parameter: /** * Classifies a given instance. * * @param instance The instance to be assigned to a cluster * @return int The number of the assigned cluster as an integer * @throws java.lang.Exception If instance could not be

Get prediction percentage in WEKA using own Java code and a model

阅读更多关于 Get prediction percentage in WEKA using own Java code and a model

问题 Overview I know that one can get the percentages of each prediction in a trained WEKA model through the GUI and command line options as conveniently explained and demonstrated in the documentation article "Making predictions". Predictions I know that there are three ways documented to get these predictions: command line GUI Java code/using the WEKA API, which I was able to do in the answer to "Get risk predictions in WEKA using own Java code" this fourth one requires a generated WEKA .MODEL

How to deal with missing attribute values in C4.5 (J48) decision tree?

阅读更多关于 How to deal with missing attribute values in C4.5 (J48) decision tree?

问题 What's the best way to handle missing feature attribute values with Weka's C4.5 (J48) decision tree? The problem of missing values occurs during both training and classification. If values are missing from training instances, am I correct in assuming that I place a '?' value for the feature? Suppose that I am able to successfully build the decision tree and then create my own tree code in C++ or Java from Weka's tree structure. During classification time, if I am trying to classify a new

How to implement proximity rules in tm dictionary for counting words?

阅读更多关于 How to implement proximity rules in tm dictionary for counting words?

问题 Objective I would like to count the number of times the word "love" appears in a documents but only if it isn't preceded by the word 'not' e.g. "I love films" would count as one appearance whilst "I do not love films" would not count as an appearance. Question How would one proceed using the tm package? R Code Below is some self contained code which I would like to modify to do the above. require(tm) # text vector my.docs <- c(" I love the Red Hot Chilli Peppers! They are the most lovely

Finding out wrongly classified instances when using WEKA

阅读更多关于 Finding out wrongly classified instances when using WEKA

问题 I am using GUI version of WEKA and I am classifying using the Naive Bayes. Can anyone please let me know how to find out which instances are misclassified. 回答1: Go to classify tab in Weka explorer Click more options... Check output predictions Click OK Hope that helps. 回答2: I faced this very same problem earlier and I tackle it just fine now. What I do, is the following: Make one String attribute that assigns each instance a unique ID. I have assigned the names of the documents to each of my

Stanford classifier cross validation averaged or aggregate metrics

阅读更多关于 Stanford classifier cross validation averaged or aggregate metrics

With Stanford Classifier it is possible to use cross validation by setting the options in the properties file, such as this for 10-fold cross validation: crossValidationFolds=10 printCrossValidationDecisions=true shuffleTrainingData=true shuffleSeed=1 Running this will output, per fold, the various metrics, such as precision, recall, Accuracy/micro-averaged F1 and Macro-averaged F1. Is there an option to get an averaged or otherwise aggregated score of all 10 Accuracy/micro-averaged F1 or all 10 Macro-averaged F1 as part of the output? In Weka, by default the output after 10-fold cross

How to change attribute type to String (WEKA - CSV to ARFF)

阅读更多关于 How to change attribute type to String (WEKA - CSV to ARFF)

I'm trying to make an SMS SPAM classifier using the WEKA library. I have a CSV file with "label" and "text" headings. When I use the code below, it creates an ARFF file with two attributes: @attribute label {ham,spam} @attribute text {'Go until jurong point','Ok lar...', etc.} Currently, it seems that the text attribute is formatted as a nominal attribute with each message's text as a value. But I need the text attribute to be a String attribute, not a list of all of the text from all instances. Having the text attribute as a String will allow me to use the StringToWordVector filter for

What is Class Index in WEKA?

阅读更多关于 What is Class Index in WEKA?

问题 I have to use WEKA in my java code for prediction. Basically I have to study a given code and reuse it. testdata.setClassIndex(data.numAttributes() - 1); I am unable to understand what the above line means. What is a Class Index? testdata and data are Intances object. 回答1: As outlined here, setClassIndex is used to define the attribute that will represent the class (for prediction purposes). Given that the index starts at zero, data.numAttributes() - 1 represents the last attribute of the