R construct document term matrix how to match dictionaries whose values consist of white-space separated phrases

后端 未结 1 973
忘了有多久
忘了有多久 2021-01-16 11:12

When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some c

相关标签:
1条回答
  • 2021-01-16 11:57

    It's possible to do this with quanteda, although it requires the construction of a dictionary for each phrase, and then pre-processing the text to convert the phrases into tokens. To become a "token", the phrases need to be joined by something other than whitespace -- here, the "_" character.

    Here are some example texts, including the phrase in the OP. I added two additional texts for the illustration -- below, the first row of the document-feature matrix produces the requested answer.

    txt <- c("We could use machine learning method to calculate the words semantic distance.",
             "Machine learning is the best sort of learning.",
             "The distance between semantic distance and machine learning is machine driven.")
    

    The current signature for phrase to token requires the phrases argument to be a dictionary or a collocations object. Here we will make it a dictionary:

    mydict <- dictionary(list(machine_learning = "machine learning", 
                              semantic_distance = "semantic distance"))
    

    Then we pre-process the text to convert the dictionary phrases to their keys:

    toks <- tokens(txt) %>%
        tokens_compound(mydict)
    toks
    # tokens from 3 documents.
    # text1 :
    # [1] "We"                "could"             "use"               "machine_learning" 
    # [5] "method"            "to"                "calculate"         "the"              
    # [9] "words"             "semantic_distance" "."                
    # 
    # text2 :
    # [1] "Machine_learning" "is"               "the"              "best"            
    # [5] "sort"             "of"               "learning"         "."               
    # 
    # text3 :
    # [1] "The"               "distance"          "between"           "semantic_distance"
    # [5] "and"               "machine_learning"  "is"                "machine"          
    # [9] "driven"            "."    
    

    Finally, we can construct the document-feature matrix, keeping all phrases using the default "glob" pattern match for any feature that includes the underscore character:

    mydfm <- dfm(toks, select = "*_*")
    mydfm
    ## Document-feature matrix of: 3 documents, 2 features.
    ## 3 x 2 sparse Matrix of class "dfm"
    ##        features
    ## docs    machine_learning semantic_distance
    ##   text1                1                 1
    ##   text2                1                 0
    ##   text3                1                 1
    

    (Answer updated for >= v0.9.9)

    0 讨论(0)
提交回复
热议问题