Extracting “((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun” from Text (Justeson & Katz, 1995)

问题

I would like to query if it is possible to extract ((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun proposed by Justeson and Katz (1995) in R package openNLP?

That is, I would like to use this linguistic filtering to extract candidate noun phrases.

I cannot well understand its meaning.

Could you do me a favor to explain it or transform such representation into R language.

Many thanks.

Maybe we can start the sample code from:

library("openNLP")  

acq <- "This paper describes a novel optical thread plug
gauge (OTPG) for internal thread inspection using machine
vision. The OTPG is composed of a rigid industrial
endoscope, a charge-coupled device camera, and a two
degree-of-freedom motion control unit. A sequence of
partial wall images of an internal thread are retrieved and
reconstructed into a 2D unwrapped image. Then, a digital
image processing and classification procedure is used to
normalize, segment, and determine the quality of the
internal thread." 

acqTag <- tagPOS(acq)     

acqTagSplit = strsplit(acqTag," ")

I was told to open a new question for this. The original question is here.

回答1:

Installing the package by:

install.packages("openNLP")
install.packages("openNLPmodels.en")

After, you could run the above code. It will POS tag all words in the text and give back the original text with all words tagged like noun, verb etc. I this example as follows:

acqTagSplit = strsplit(acqTag," ")
> acqTag
[1] "This/DT paper/NN describes/VBZ a/DT novel/NN optical/JJ thread/NN plug/NN gauge/NN (OTPG)/NN for/IN internal/JJ thread/NN inspection/NN using/VBG machine/NN vision./NN The/DT OTPG/NNP is/VBZ composed/VBN of/IN a/DT rigid/JJ industrial/JJ endoscope,/NNS a/DT charge-coupled/JJ device/NN camera,/VBD and/CC a/DT two/CD degree-of-freedom/NN motion/NN control/NN unit./NN A/DT sequence/NN of/IN partial/JJ wall/NN images/NNS of/IN an/DT internal/JJ thread/NN are/VBP retrieved/VBN and/CC reconstructed/VBN into/IN a/DT 2D/JJ unwrapped/JJ image./NN Then,/IN a/DT digital/JJ image/NN processing/NN and/CC classification/NN procedure/NN is/VBZ used/VBN to/TO normalize,/JJ segment,/NN and/CC determine/VB the/DT quality/NN of/IN the/DT internal/JJ thread./NN"

After all word, separated by a dash, you have all the POS tags. To separate theese from the word, you could first separate the words - as you did in your example:

acqTagSplit = strsplit(acqTag," ")
acqTagSplit
    [[1]]
     [1] "This/DT"              "paper/NN"             "describes/VBZ"       
     [4] "a/DT"                 "novel/NN"             "optical/JJ"          
     [7] "thread/NN"            "plug/NN"              "gauge/NN"            
    [10] "(OTPG)/NN"            "for/IN"               "internal/JJ"         
    [13] "thread/NN"            "inspection/NN"        "using/VBG"           
    [16] "machine/NN"           "vision./NN"           "The/DT"              
    [19] "OTPG/NNP"             "is/VBZ"               "composed/VBN"        
    [22] "of/IN"                "a/DT"                 "rigid/JJ"            
    [25] "industrial/JJ"        "endoscope,/NNS"       "a/DT"                
    [28] "charge-coupled/JJ"    "device/NN"            "camera,/VBD"         
    [31] "and/CC"               "a/DT"                 "two/CD"              
    [34] "degree-of-freedom/NN" "motion/NN"            "control/NN"          
    [37] "unit./NN"             "A/DT"                 "sequence/NN"         
    [40] "of/IN"                "partial/JJ"           "wall/NN"             
    [43] "images/NNS"           "of/IN"                "an/DT"               
    [46] "internal/JJ"          "thread/NN"            "are/VBP"             
    [49] "retrieved/VBN"        "and/CC"               "reconstructed/VBN"   
    [52] "into/IN"              "a/DT"                 "2D/JJ"               
    [55] "unwrapped/JJ"         "image./NN"            "Then,/IN"            
    [58] "a/DT"                 "digital/JJ"           "image/NN"            
    [61] "processing/NN"        "and/CC"               "classification/NN"   
    [64] "procedure/NN"         "is/VBZ"               "used/VBN"            
    [67] "to/TO"                "normalize,/JJ"        "segment,/NN"         
    [70] "and/CC"               "determine/VB"         "the/DT"              
    [73] "quality/NN"           "of/IN"                "the/DT"              
    [76] "internal/JJ"          "thread./NN"

And later split up the words from the POS tags:

strsplit(acqTagSplit[[1]], "/")

You will have a list, which contains all of your words with the tags, and inside first have the word and after the tag separated. See:

str(strsplit(acqTagSplit[[1]], "/"))
List of 77
 $ : chr [1:2] "This" "DT"
 $ : chr [1:2] "paper" "NN"
 $ : chr [1:2] "describes" "VBZ"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "novel" "NN"
 $ : chr [1:2] "optical" "JJ"
 $ : chr [1:2] "thread" "NN"
 $ : chr [1:2] "plug" "NN"
 $ : chr [1:2] "gauge" "NN"
 $ : chr [1:2] "(OTPG)" "NN"
 $ : chr [1:2] "for" "IN"
 $ : chr [1:2] "internal" "JJ"
 $ : chr [1:2] "thread" "NN"
 $ : chr [1:2] "inspection" "NN"
 $ : chr [1:2] "using" "VBG"
 $ : chr [1:2] "machine" "NN"
 $ : chr [1:2] "vision." "NN"
 $ : chr [1:2] "The" "DT"
 $ : chr [1:2] "OTPG" "NNP"
 $ : chr [1:2] "is" "VBZ"
 $ : chr [1:2] "composed" "VBN"
 $ : chr [1:2] "of" "IN"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "rigid" "JJ"
 $ : chr [1:2] "industrial" "JJ"
 $ : chr [1:2] "endoscope," "NNS"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "charge-coupled" "JJ"
 $ : chr [1:2] "device" "NN"
 $ : chr [1:2] "camera," "VBD"
 $ : chr [1:2] "and" "CC"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "two" "CD"
 $ : chr [1:2] "degree-of-freedom" "NN"
 $ : chr [1:2] "motion" "NN"
 $ : chr [1:2] "control" "NN"
 $ : chr [1:2] "unit." "NN"
 $ : chr [1:2] "A" "DT"
 $ : chr [1:2] "sequence" "NN"
 $ : chr [1:2] "of" "IN"
 $ : chr [1:2] "partial" "JJ"
 $ : chr [1:2] "wall" "NN"
 $ : chr [1:2] "images" "NNS"
 $ : chr [1:2] "of" "IN"
 $ : chr [1:2] "an" "DT"
 $ : chr [1:2] "internal" "JJ"
 $ : chr [1:2] "thread" "NN"
 $ : chr [1:2] "are" "VBP"
 $ : chr [1:2] "retrieved" "VBN"
 $ : chr [1:2] "and" "CC"
 $ : chr [1:2] "reconstructed" "VBN"
 $ : chr [1:2] "into" "IN"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "2D" "JJ"
 $ : chr [1:2] "unwrapped" "JJ"
 $ : chr [1:2] "image." "NN"
 $ : chr [1:2] "Then," "IN"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "digital" "JJ"
 $ : chr [1:2] "image" "NN"
 $ : chr [1:2] "processing" "NN"
 $ : chr [1:2] "and" "CC"
 $ : chr [1:2] "classification" "NN"
 $ : chr [1:2] "procedure" "NN"
 $ : chr [1:2] "is" "VBZ"
 $ : chr [1:2] "used" "VBN"
 $ : chr [1:2] "to" "TO"
 $ : chr [1:2] "normalize," "JJ"
 $ : chr [1:2] "segment," "NN"
 $ : chr [1:2] "and" "CC"
 $ : chr [1:2] "determine" "VB"
 $ : chr [1:2] "the" "DT"
 $ : chr [1:2] "quality" "NN"
 $ : chr [1:2] "of" "IN"
 $ : chr [1:2] "the" "DT"
 $ : chr [1:2] "internal" "JJ"
 $ : chr [1:2] "thread." "NN"

回答2:

It seems like you need to understand the regular expression: ((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun, convert it to a DFA (deterministic finite automata) and follow the DFA in R.

Here you have a description of a regular language through a regular expression. Unlike the common usage of regular expressions in text processing the "symbols" are not simple characters, but adjectives, nouns and noun prepositions. Once you understand the theory (automata theory), you will be able to easily implement the DFA in R (or whatever PL you choose).

The problem in not R, the problem is that you don't understand the theory.

来源：https://stackoverflow.com/questions/4610974/extracting-adjnounadjnounnoun-prepadjnounnoun-from-text-jus

标签

text-parsing

linguistics

nlp