Searching Natural Language Sentence Structure

前端 未结 3 713
广开言路
广开言路 2021-02-04 19:54

What\'s the best way to store and search a database of natural language sentence structure trees?

Using OpenNLP\'s English Treebank Parser, I can get fairly reliable se

3条回答
  •  攒了一身酷
    2021-02-04 20:28

    I agree with ffriend that you need to take a different approach that builds on existing work on knowledge bases and natural language search. Storing context-free parse trees in a relational database isn't the problem, but it is going to be very difficult to do a meaningful comparison of parse trees as part of a search. When you are just interested taking advantage of a little knowledge about grammatical relations, parse trees are really too complicated. If you simplify the parse into dependency triples, you can make the search problem much easier and get at the grammatical relations you were interested in in the first place. For instance, you could use the Stanford dependency parser, which generates a context-free parse and then extracts dependency triples from it. It produces output like this for "This function uploads files to a remote machine":

    det(function-2, This-1)
    nsubj(uploads-3, function-2)
    dobj(uploads-3, files-4)
    det(machine-8, a-6)
    amod(machine-8, remote-7)
    prep_to(uploads-3, machine-8)
    

    In your database, you could store a simplified subset of these triples associated with the function, e.g.:

    upload_file(): subj(uploads, function)
    upload_file(): obj(uploads, file)
    upload_file(): prep(uploads, machine)
    

    When people search, you can find the function that has the most overlapping triples or something along those lines, where you probably also want to weight the different dependency relations or allow partial matches, etc. You probably also want to reduce the words in the triples to lemmas, maybe POS depending on what you need.

    There are plenty of people who have worked on natural language search (like Powerset), so be sure to search for existing approaches. My proposed approach here is really minimal and I can think of tons of examples where it will have problems, but I think something along these lines could work reasonably well for a restricted domain.

提交回复
热议问题