Does anyone have a Categorized XML Corpus Reader for NLTK?

问题

Has anyone written a Categorized XML Corpus reader for NLTK?

I'm working with the Annotated NYTimes corpus. It's an XML corpus. I can read the files with XMLCorpusReader but I'd like to use some of NLTK's category functionality. There's a nice tutorial for subclassing NLTK readers. I'll can go ahead and write this but was hoping to save some time if someone's already done this.

If not I'll post what I've written.

回答1:

Here's a Categorized XML Corpus Reader for NLTK. It's based on this tutorial. This lets you use NLTK's category-based features on XML Corpora like the New York Times Annotated Corpus.

Call this file CategorizedXMLCorpusReader.py and import this as:

import imp                                                                                                                                                                                                                     
CatXMLReader = imp.load_source('CategorizedXMLCorpusReader','PATH_TO_THIS_FILE/CategorizedXMLCorpusReader.py')

You can then use this like any other NLTK Reader. For instance,

CatXMLReader = CatXMLReader.CategorizedXMLCorpusReader('.../nltk_data/corpora/nytimes', file_ids, cat_file='PATH_TO_CATEGORIES_FILE')

I'm still figuring NLTK out so any corrections or suggestions are welcome.

# Categorized XML Corpus Reader                                                                                                                                                                                                  

from nltk.corpus.reader import CategorizedCorpusReader, XMLCorpusReader
class CategorizedXMLCorpusReader(CategorizedCorpusReader, XMLCorpusReader):
    def __init__(self, *args, **kwargs):
        CategorizedCorpusReader.__init__(self, kwargs)
        XMLCorpusReader.__init__(self, *args, **kwargs)
    def _resolve(self, fileids, categories):
        if fileids is not None and categories is not None:
            raise ValueError('Specify fileids or categories, not both')
        if categories is not None:
            return self.fileids(categories)
        else:
            return fileids

        # All of the following methods call the corresponding function in ChunkedCorpusReader                                                                                                                                    
        # with the value returned from _resolve(). We'll start with the plain text methods.                                                                                                                                      
    def raw(self, fileids=None, categories=None):
        return XMLCorpusReader.raw(self, self._resolve(fileids, categories))

    def words(self, fileids=None, categories=None):
        #return CategorizedCorpusReader.words(self, self._resolve(fileids, categories))                                                                                                                                          
        # Can I just concat words over each file in a file list?                                                                                                                                                                 
        words=[]
        fileids = self._resolve(fileids, categories)
        # XMLCorpusReader.words works on one file at a time. Concatenate them here.                                                                                                                                              
        for fileid in fileids:
            words+=XMLCorpusReader.words(self, fileid)
        return words

    # This returns a string of the text of the XML docs without any markup                                                                                                                                                       
    def text(self, fileids=None, categories=None):
        fileids = self._resolve(fileids, categories)
        text = ""
        for fileid in fileids:
            for i in self.xml(fileid).getiterator():
                if i.text:
                    text += i.text
        return text

    # This returns all text for a specified xml field                                                                                                                                                                            
    def fieldtext(self, fileids=None, categories=None):
        # NEEDS TO BE WRITTEN                                                                                                                                                                                                    
        return

    def sents(self, fileids=None, categories=None):
        #return CategorizedCorpusReader.sents(self, self._resolve(fileids, categories))                                                                                                                                          
        text = self.words(fileids, categories)
        sents=nltk.PunktSentenceTokenizer().tokenize(text)
        return sents

    def paras(self, fileids=None, categories=None):
        return CategorizedCorpusReader.paras(self, self._resolve(fileids, categories))

回答2:

Sorry NAD, but posting it as a new question was the only way I found to discuss this code. I'm also using and found a small bug when trying to use categories with words() method. Here: https://github.com/nltk/nltk/issues/250#issuecomment-5273102

Did you hit this problem before me? Also, have you done any further modifications on it that may make categories to work? My e-mail is in my profile page if you wanna talk about it off-SO :-)

来源：https://stackoverflow.com/questions/6849600/does-anyone-have-a-categorized-xml-corpus-reader-for-nltk

标签

python

xml

nltk

corpus