Stanford Parser and NLTK

后端 未结 18 2344
既然无缘
既然无缘 2020-11-22 01:32

Is it possible to use Stanford Parser in NLTK? (I am not talking about Stanford POS.)

18条回答
  •  孤街浪徒
    2020-11-22 02:36

    As of NLTK v3.3, users should avoid the Stanford NER or POS taggers from nltk.tag, and avoid Stanford tokenizer/segmenter from nltk.tokenize.

    Instead use the new nltk.parse.corenlp.CoreNLPParser API.

    Please see https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK


    (Avoiding link only answer, I've pasted the docs from NLTK github wiki below)

    First, update your NLTK

    pip3 install -U nltk # Make sure is >=3.3
    

    Then download the necessary CoreNLP packages:

    cd ~
    wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip
    unzip stanford-corenlp-full-2018-02-27.zip
    cd stanford-corenlp-full-2018-02-27
    
    # Get the Chinese model 
    wget http://nlp.stanford.edu/software/stanford-chinese-corenlp-2018-02-27-models.jar
    wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties 
    
    # Get the Arabic model
    wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2018-02-27-models.jar
    wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties 
    
    # Get the French model
    wget http://nlp.stanford.edu/software/stanford-french-corenlp-2018-02-27-models.jar
    wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-french.properties 
    
    # Get the German model
    wget http://nlp.stanford.edu/software/stanford-german-corenlp-2018-02-27-models.jar
    wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-german.properties 
    
    
    # Get the Spanish model
    wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2018-02-27-models.jar
    wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-spanish.properties 
    

    English

    Still in the stanford-corenlp-full-2018-02-27 directory, start the server:

    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -preload tokenize,ssplit,pos,lemma,ner,parse,depparse \
    -status_port 9000 -port 9000 -timeout 15000 & 
    

    Then in Python:

    >>> from nltk.parse import CoreNLPParser
    
    # Lexical Parser
    >>> parser = CoreNLPParser(url='http://localhost:9000')
    
    # Parse tokenized text.
    >>> list(parser.parse('What is the airspeed of an unladen swallow ?'.split()))
    [Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['airspeed'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['unladen'])])]), Tree('S', [Tree('VP', [Tree('VB', ['swallow'])])])])]), Tree('.', ['?'])])])]
    
    # Parse raw string.
    >>> list(parser.raw_parse('What is the airspeed of an unladen swallow ?'))
    [Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['airspeed'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['unladen'])])]), Tree('S', [Tree('VP', [Tree('VB', ['swallow'])])])])]), Tree('.', ['?'])])])]
    
    # Neural Dependency Parser
    >>> from nltk.parse.corenlp import CoreNLPDependencyParser
    >>> dep_parser = CoreNLPDependencyParser(url='http://localhost:9000')
    >>> parses = dep_parser.parse('What is the airspeed of an unladen swallow ?'.split())
    >>> [[(governor, dep, dependent) for governor, dep, dependent in parse.triples()] for parse in parses]
    [[(('What', 'WP'), 'cop', ('is', 'VBZ')), (('What', 'WP'), 'nsubj', ('airspeed', 'NN')), (('airspeed', 'NN'), 'det', ('the', 'DT')), (('airspeed', 'NN'), 'nmod', ('swallow', 'VB')), (('swallow', 'VB'), 'case', ('of', 'IN')), (('swallow', 'VB'), 'det', ('an', 'DT')), (('swallow', 'VB'), 'amod', ('unladen', 'JJ')), (('What', 'WP'), 'punct', ('?', '.'))]]
    
    
    # Tokenizer
    >>> parser = CoreNLPParser(url='http://localhost:9000')
    >>> list(parser.tokenize('What is the airspeed of an unladen swallow?'))
    ['What', 'is', 'the', 'airspeed', 'of', 'an', 'unladen', 'swallow', '?']
    
    # POS Tagger
    >>> pos_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
    >>> list(pos_tagger.tag('What is the airspeed of an unladen swallow ?'.split()))
    [('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
    
    # NER Tagger
    >>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
    >>> list(ner_tagger.tag(('Rami Eid is studying at Stony Brook University in NY'.split())))
    [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'STATE_OR_PROVINCE')]
    

    Chinese

    Start the server a little differently, still from the `stanford-corenlp-full-2018-02-27 directory:

    java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordCoreNLP-chinese.properties \
    -preload tokenize,ssplit,pos,lemma,ner,parse \
    -status_port 9001  -port 9001 -timeout 15000
    

    In Python:

    >>> parser = CoreNLPParser('http://localhost:9001')
    >>> list(parser.tokenize(u'我家没有电脑。'))
    ['我家', '没有', '电脑', '。']
    
    >>> list(parser.parse(parser.tokenize(u'我家没有电脑。')))
    [Tree('ROOT', [Tree('IP', [Tree('IP', [Tree('NP', [Tree('NN', ['我家'])]), Tree('VP', [Tree('VE', ['没有']), Tree('NP', [Tree('NN', ['电脑'])])])]), Tree('PU', ['。'])])])]
    

    Arabic

    Start the server:

    java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordCoreNLP-arabic.properties \
    -preload tokenize,ssplit,pos,parse \
    -status_port 9005  -port 9005 -timeout 15000
    

    In Python:

    >>> from nltk.parse import CoreNLPParser
    >>> parser = CoreNLPParser('http://localhost:9005')
    >>> text = u'انا حامل'
    
    # Parser.
    >>> parser.raw_parse(text)
    
    >>> list(parser.raw_parse(text))
    [Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['انا'])]), Tree('NP', [Tree('NN', ['حامل'])])])])]
    >>> list(parser.parse(parser.tokenize(text)))
    [Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['انا'])]), Tree('NP', [Tree('NN', ['حامل'])])])])]
    
    # Tokenizer / Segmenter.
    >>> list(parser.tokenize(text))
    ['انا', 'حامل']
    
    # POS tagg
    >>> pos_tagger = CoreNLPParser('http://localhost:9005', tagtype='pos')
    >>> list(pos_tagger.tag(parser.tokenize(text)))
    [('انا', 'PRP'), ('حامل', 'NN')]
    
    
    # NER tag
    >>> ner_tagger = CoreNLPParser('http://localhost:9005', tagtype='ner')
    >>> list(ner_tagger.tag(parser.tokenize(text)))
    [('انا', 'O'), ('حامل', 'O')]
    

    French

    Start the server:

    java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordCoreNLP-french.properties \
    -preload tokenize,ssplit,pos,parse \
    -status_port 9004  -port 9004 -timeout 15000
    

    In Python:

    >>> parser = CoreNLPParser('http://localhost:9004')
    >>> list(parser.parse('Je suis enceinte'.split()))
    [Tree('ROOT', [Tree('SENT', [Tree('NP', [Tree('PRON', ['Je']), Tree('VERB', ['suis']), Tree('AP', [Tree('ADJ', ['enceinte'])])])])])]
    >>> pos_tagger = CoreNLPParser('http://localhost:9004', tagtype='pos')
    >>> pos_tagger.tag('Je suis enceinte'.split())
    [('Je', 'PRON'), ('suis', 'VERB'), ('enceinte', 'ADJ')]
    

    German

    Start the server:

    java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordCoreNLP-german.properties \
    -preload tokenize,ssplit,pos,ner,parse \
    -status_port 9002  -port 9002 -timeout 15000
    

    In Python:

    >>> parser = CoreNLPParser('http://localhost:9002')
    >>> list(parser.raw_parse('Ich bin schwanger'))
    [Tree('ROOT', [Tree('NUR', [Tree('S', [Tree('PPER', ['Ich']), Tree('VAFIN', ['bin']), Tree('AP', [Tree('ADJD', ['schwanger'])])])])])]
    >>> list(parser.parse('Ich bin schwanger'.split()))
    [Tree('ROOT', [Tree('NUR', [Tree('S', [Tree('PPER', ['Ich']), Tree('VAFIN', ['bin']), Tree('AP', [Tree('ADJD', ['schwanger'])])])])])]
    
    
    >>> pos_tagger = CoreNLPParser('http://localhost:9002', tagtype='pos')
    >>> pos_tagger.tag('Ich bin schwanger'.split())
    [('Ich', 'PPER'), ('bin', 'VAFIN'), ('schwanger', 'ADJD')]
    
    >>> pos_tagger = CoreNLPParser('http://localhost:9002', tagtype='pos')
    >>> pos_tagger.tag('Ich bin schwanger'.split())
    [('Ich', 'PPER'), ('bin', 'VAFIN'), ('schwanger', 'ADJD')]
    
    >>> ner_tagger = CoreNLPParser('http://localhost:9002', tagtype='ner')
    >>> ner_tagger.tag('Donald Trump besuchte Angela Merkel in Berlin.'.split())
    [('Donald', 'PERSON'), ('Trump', 'PERSON'), ('besuchte', 'O'), ('Angela', 'PERSON'), ('Merkel', 'PERSON'), ('in', 'O'), ('Berlin', 'LOCATION'), ('.', 'O')]
    

    Spanish

    Start the server:

    java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordCoreNLP-spanish.properties \
    -preload tokenize,ssplit,pos,ner,parse \
    -status_port 9003  -port 9003 -timeout 15000
    

    In Python:

    >>> pos_tagger = CoreNLPParser('http://localhost:9003', tagtype='pos')
    >>> pos_tagger.tag(u'Barack Obama salió con Michael Jackson .'.split())
    [('Barack', 'PROPN'), ('Obama', 'PROPN'), ('salió', 'VERB'), ('con', 'ADP'), ('Michael', 'PROPN'), ('Jackson', 'PROPN'), ('.', 'PUNCT')]
    >>> ner_tagger = CoreNLPParser('http://localhost:9003', tagtype='ner')
    >>> ner_tagger.tag(u'Barack Obama salió con Michael Jackson .'.split())
    [('Barack', 'PERSON'), ('Obama', 'PERSON'), ('salió', 'O'), ('con', 'O'), ('Michael', 'PERSON'), ('Jackson', 'PERSON'), ('.', 'O')]
    

提交回复
热议问题