Combining a Tokenizer into a Grammar and Parser with NLTK

前端 未结 3 1767
被撕碎了的回忆
被撕碎了的回忆 2021-01-31 03:50

I am making my way through the NLTK book and I can\'t seem to do something that would appear to be a natural first step for building a decent grammar.

My goal is to buil

3条回答
  •  慢半拍i
    慢半拍i (楼主)
    2021-01-31 04:38

    I know this is a year later but I wanted to add some thoughts.

    I take a lot of different sentences and tag them with parts of speech for a project I'm working on. From there I was doing as StompChicken suggested, pulling the tags from the tuples (word, tag) and using those tags as the "terminals" (the bottom nodes of tree as we create a completely tagged sentence).

    Ultimately this doesn't suite my desire to mark head nouns in noun phrases, since I can't pull the head noun "word" into the grammar, since the grammar only has the tags.

    So what I did was instead use the set of (word, tag) tuples to create a dictionary of tags, with all the words with that tag as values for that tag. Then I print this dictionary to the screen/grammar.cfg (context free grammar) file.

    The form I use to print it works perfectly with setting up a parser through loading a grammar file (parser = nltk.load_parser('grammar.cfg')). One of the lines it generates looks like this:

    VBG -> "fencing" | "bonging" | "amounting" | "living" ... over 30 more words...

    So now my grammar has the actual words as terminals and assigns the same tags that nltk.tag_pos does.

    Hope this helps anyone else wanting to automate tagging a large corpus and still have the actual words as terminals in their grammar.

    import nltk
    from collections import defaultdict
    
    tag_dict = defaultdict(list)
    
    ...
        """ (Looping through sentences) """
    
            # Tag
            tagged_sent = nltk.pos_tag(tokens)
    
            # Put tags and words into the dictionary
            for word, tag in tagged_sent:
                if tag not in tag_dict:
                    tag_dict[tag].append(word)
                elif word not in tag_dict.get(tag):
                    tag_dict[tag].append(word)
    
    # Printing to screen
    for tag, words in tag_dict.items():
        print tag, "->",
        first_word = True
        for word in words:
            if first_word:
                print "\"" + word + "\"",
                first_word = False
            else:
                print "| \"" + word + "\"",
        print ''
    

提交回复
热议问题