how to use weka in keyphrase extraction from text arguments

问题

I am working on a project "key phrase extraction from text arguments" . For this I first did input cleaning and then detemined list of candidate phrases( in total around 300) using stanford parser(POS tagging). Then I computed feature value of each and every phrase. I followed these steps on each and every document in my dataset. Now how should I proceed i.e.., how to use WEKA to find keyphrases. How should I store phrases and feature values(TFXIDF) in weka . How to find efficiency of the final project??

回答1:

WEKA does an excellent and simple work with Text Classification tasks (like Text Categorization and Clustering), in which the instances are relatively long pieces of text (e.g. from tweets to documents), and classes (when available) are non-overlapping tags (e.g. thematic classes like economy/sports/..., spam/legitimate email, positive/negative in sentiment analysis, etc.).

However WEKA does not fit directly term classification tasks like Part Of Specch Tagging, Word Sense Disambiguation, Named Entity Recognition, or in your case, keyphrase extraction. For applying WEKA, yo do not only need your original texts and the manually extracted keyphrases, but to decide the atributes that make those pieces of text actual keyphrases. You have to inspect your examples, and decide, for instance, that the part of speech of the words in a keyphase and the surrounding words are actually important in order to guess that a piece of text is a keyphrase.

I strongly recommend you take a look at the representation used in the datasets used in the CONLL NER shared tasks (CONLL 2002 and 2003). Each word in named entity is independent and marked as starting, in the middle or at the end of the named entity. Additionally, the features you can use are the actual words, the surrounding words, and their parts of speech.

For instance, in the example of the NER 2003 dataset:

   U.N.         NNP  I-NP  I-ORG 
   official     NN   I-NP  O 
   Ekeus        NNP  I-NP  I-PER 
   heads        VBZ  I-VP  O 
   for          IN   I-PP  O

You have that the word "Ekeus" is an NNP, it is inside a Noun Phrase (I-NP), and it is a named entity of type "person" (I-PER). You can process this format to get an instance file in which you use the POS tag and the actual words in a two-word window:

@attribute word-2 string
@attribute word-1 string
@attribute word string
@attribute word+1 string
@attribute word+2 string
@attribute postag-2 {NNP, NN, ....} // The full list of available POS tags
@attribute postag-1 {NNP, NN, ....}
// ../..
@attribute named-entity-class {O, I-PER, I-ORG, ...} // The full list of possible NE tags

@data
"U.N.","official","Ekeus","heads","for",NNP,NN,NNP,VBZ,IN,I-PER
../..

As you can see, you have to decide the attributes you need for each word and to build windows with the attributes.

来源：https://stackoverflow.com/questions/20002095/how-to-use-weka-in-keyphrase-extraction-from-text-arguments

标签

weka