How to select only a subset of corpus terms for TermDocumentMatrix creation in tm

前端 未结 2 572
无人及你
无人及你 2021-01-22 09:21

I have a huge corpus, and I\'m interested in only appearance of a handful of terms that I know up front. Is there a way to create a term document matrix from the corpus using th

2条回答
  •  一个人的身影
    2021-01-22 10:10

    You can modify a corpus to keep only the terms you want by building a custom transformation function. See the Vignette for the tm package and the help for the content_transformer function for more information:

    library(tm)
    
    # Create a corpus from the text listed below
    corp = VCorpus(VectorSource(doc))
    
    # Custom function to keep only the terms in "pattern" and remove everything else
    (f <- content_transformer(function(x, pattern) 
      regmatches(x, gregexpr(pattern, x, perl=TRUE, ignore.case=TRUE))))
    

    (FYI, the second line of code just above is adapted from this SO answer.)

    # The pattern we'll search for
    keep = "sleep|dream|die"
    
    # Run the transformation function using the pattern above
    tm_map(corp, f, keep)[[1]]
    

    Here's the result of running the transformation function:

    <>
      c("die", "sleep", "sleep", "die", "sleep", "sleep", "Dream")
    

    Here's the original text I used to create the corpus:

    doc = "To be, or not to be, that is the question—
    Whether 'tis Nobler in the mind to suffer
    The Slings and Arrows of outrageous Fortune,
    Or to take Arms against a Sea of troubles,
    And by opposing, end them? To die, to sleep—
    No more; and by a sleep, to say we end
    The Heart-ache, and the thousand Natural shocks
    That Flesh is heir to? 'Tis a consummation
    Devoutly to be wished. To die, to sleep,
    To sleep, perchance to Dream; Aye, there's the rub"
    

提交回复
热议问题