Extracting “((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun” from Text (Justeson & Katz, 1995)

蓝咒 提交于 2019-11-29 08:47:43

Installing the package by:

install.packages("openNLP")
install.packages("openNLPmodels.en")

After, you could run the above code. It will POS tag all words in the text and give back the original text with all words tagged like noun, verb etc. I this example as follows:

acqTagSplit = strsplit(acqTag," ")
> acqTag
[1] "This/DT paper/NN describes/VBZ a/DT novel/NN optical/JJ thread/NN plug/NN gauge/NN (OTPG)/NN for/IN internal/JJ thread/NN inspection/NN using/VBG machine/NN vision./NN The/DT OTPG/NNP is/VBZ composed/VBN of/IN a/DT rigid/JJ industrial/JJ endoscope,/NNS a/DT charge-coupled/JJ device/NN camera,/VBD and/CC a/DT two/CD degree-of-freedom/NN motion/NN control/NN unit./NN A/DT sequence/NN of/IN partial/JJ wall/NN images/NNS of/IN an/DT internal/JJ thread/NN are/VBP retrieved/VBN and/CC reconstructed/VBN into/IN a/DT 2D/JJ unwrapped/JJ image./NN Then,/IN a/DT digital/JJ image/NN processing/NN and/CC classification/NN procedure/NN is/VBZ used/VBN to/TO normalize,/JJ segment,/NN and/CC determine/VB the/DT quality/NN of/IN the/DT internal/JJ thread./NN"

After all word, separated by a dash, you have all the POS tags. To separate theese from the word, you could first separate the words - as you did in your example:

acqTagSplit = strsplit(acqTag," ")
acqTagSplit
    [[1]]
     [1] "This/DT"              "paper/NN"             "describes/VBZ"       
     [4] "a/DT"                 "novel/NN"             "optical/JJ"          
     [7] "thread/NN"            "plug/NN"              "gauge/NN"            
    [10] "(OTPG)/NN"            "for/IN"               "internal/JJ"         
    [13] "thread/NN"            "inspection/NN"        "using/VBG"           
    [16] "machine/NN"           "vision./NN"           "The/DT"              
    [19] "OTPG/NNP"             "is/VBZ"               "composed/VBN"        
    [22] "of/IN"                "a/DT"                 "rigid/JJ"            
    [25] "industrial/JJ"        "endoscope,/NNS"       "a/DT"                
    [28] "charge-coupled/JJ"    "device/NN"            "camera,/VBD"         
    [31] "and/CC"               "a/DT"                 "two/CD"              
    [34] "degree-of-freedom/NN" "motion/NN"            "control/NN"          
    [37] "unit./NN"             "A/DT"                 "sequence/NN"         
    [40] "of/IN"                "partial/JJ"           "wall/NN"             
    [43] "images/NNS"           "of/IN"                "an/DT"               
    [46] "internal/JJ"          "thread/NN"            "are/VBP"             
    [49] "retrieved/VBN"        "and/CC"               "reconstructed/VBN"   
    [52] "into/IN"              "a/DT"                 "2D/JJ"               
    [55] "unwrapped/JJ"         "image./NN"            "Then,/IN"            
    [58] "a/DT"                 "digital/JJ"           "image/NN"            
    [61] "processing/NN"        "and/CC"               "classification/NN"   
    [64] "procedure/NN"         "is/VBZ"               "used/VBN"            
    [67] "to/TO"                "normalize,/JJ"        "segment,/NN"         
    [70] "and/CC"               "determine/VB"         "the/DT"              
    [73] "quality/NN"           "of/IN"                "the/DT"              
    [76] "internal/JJ"          "thread./NN"          

And later split up the words from the POS tags:

strsplit(acqTagSplit[[1]], "/")

You will have a list, which contains all of your words with the tags, and inside first have the word and after the tag separated. See:

str(strsplit(acqTagSplit[[1]], "/"))
List of 77
 $ : chr [1:2] "This" "DT"
 $ : chr [1:2] "paper" "NN"
 $ : chr [1:2] "describes" "VBZ"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "novel" "NN"
 $ : chr [1:2] "optical" "JJ"
 $ : chr [1:2] "thread" "NN"
 $ : chr [1:2] "plug" "NN"
 $ : chr [1:2] "gauge" "NN"
 $ : chr [1:2] "(OTPG)" "NN"
 $ : chr [1:2] "for" "IN"
 $ : chr [1:2] "internal" "JJ"
 $ : chr [1:2] "thread" "NN"
 $ : chr [1:2] "inspection" "NN"
 $ : chr [1:2] "using" "VBG"
 $ : chr [1:2] "machine" "NN"
 $ : chr [1:2] "vision." "NN"
 $ : chr [1:2] "The" "DT"
 $ : chr [1:2] "OTPG" "NNP"
 $ : chr [1:2] "is" "VBZ"
 $ : chr [1:2] "composed" "VBN"
 $ : chr [1:2] "of" "IN"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "rigid" "JJ"
 $ : chr [1:2] "industrial" "JJ"
 $ : chr [1:2] "endoscope," "NNS"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "charge-coupled" "JJ"
 $ : chr [1:2] "device" "NN"
 $ : chr [1:2] "camera," "VBD"
 $ : chr [1:2] "and" "CC"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "two" "CD"
 $ : chr [1:2] "degree-of-freedom" "NN"
 $ : chr [1:2] "motion" "NN"
 $ : chr [1:2] "control" "NN"
 $ : chr [1:2] "unit." "NN"
 $ : chr [1:2] "A" "DT"
 $ : chr [1:2] "sequence" "NN"
 $ : chr [1:2] "of" "IN"
 $ : chr [1:2] "partial" "JJ"
 $ : chr [1:2] "wall" "NN"
 $ : chr [1:2] "images" "NNS"
 $ : chr [1:2] "of" "IN"
 $ : chr [1:2] "an" "DT"
 $ : chr [1:2] "internal" "JJ"
 $ : chr [1:2] "thread" "NN"
 $ : chr [1:2] "are" "VBP"
 $ : chr [1:2] "retrieved" "VBN"
 $ : chr [1:2] "and" "CC"
 $ : chr [1:2] "reconstructed" "VBN"
 $ : chr [1:2] "into" "IN"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "2D" "JJ"
 $ : chr [1:2] "unwrapped" "JJ"
 $ : chr [1:2] "image." "NN"
 $ : chr [1:2] "Then," "IN"
 $ : chr [1:2] "a" "DT"
 $ : chr [1:2] "digital" "JJ"
 $ : chr [1:2] "image" "NN"
 $ : chr [1:2] "processing" "NN"
 $ : chr [1:2] "and" "CC"
 $ : chr [1:2] "classification" "NN"
 $ : chr [1:2] "procedure" "NN"
 $ : chr [1:2] "is" "VBZ"
 $ : chr [1:2] "used" "VBN"
 $ : chr [1:2] "to" "TO"
 $ : chr [1:2] "normalize," "JJ"
 $ : chr [1:2] "segment," "NN"
 $ : chr [1:2] "and" "CC"
 $ : chr [1:2] "determine" "VB"
 $ : chr [1:2] "the" "DT"
 $ : chr [1:2] "quality" "NN"
 $ : chr [1:2] "of" "IN"
 $ : chr [1:2] "the" "DT"
 $ : chr [1:2] "internal" "JJ"
 $ : chr [1:2] "thread." "NN"

It seems like you need to understand the regular expression: ((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun, convert it to a DFA (deterministic finite automata) and follow the DFA in R.

Here you have a description of a regular language through a regular expression. Unlike the common usage of regular expressions in text processing the "symbols" are not simple characters, but adjectives, nouns and noun prepositions. Once you understand the theory (automata theory), you will be able to easily implement the DFA in R (or whatever PL you choose).

The problem in not R, the problem is that you don't understand the theory.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!