How to select only a subset of corpus terms for TermDocumentMatrix creation in tm

前端 未结 2 570
无人及你
无人及你 2021-01-22 09:21

I have a huge corpus, and I\'m interested in only appearance of a handful of terms that I know up front. Is there a way to create a term document matrix from the corpus using th

相关标签:
2条回答
  • 2021-01-22 09:58

    An another way of filtering a corpus; First assign your value to the meta part, say language; by looping elements of the corpus with the variable i, check whatever you want, then filter by using with these meta attribute.

    corpusz[[i]]$meta["language"] <- 'tur'
    
    idx <- meta(corpusz, "language") ==  'tur'
    filtered <- corpusz[idx]
    

    Now filtered containes only the corpus elements we want.

    0 讨论(0)
  • 2021-01-22 10:10

    You can modify a corpus to keep only the terms you want by building a custom transformation function. See the Vignette for the tm package and the help for the content_transformer function for more information:

    library(tm)
    
    # Create a corpus from the text listed below
    corp = VCorpus(VectorSource(doc))
    
    # Custom function to keep only the terms in "pattern" and remove everything else
    (f <- content_transformer(function(x, pattern) 
      regmatches(x, gregexpr(pattern, x, perl=TRUE, ignore.case=TRUE))))
    

    (FYI, the second line of code just above is adapted from this SO answer.)

    # The pattern we'll search for
    keep = "sleep|dream|die"
    
    # Run the transformation function using the pattern above
    tm_map(corp, f, keep)[[1]]
    

    Here's the result of running the transformation function:

    <<PlainTextDocument (metadata: 7)>>
      c("die", "sleep", "sleep", "die", "sleep", "sleep", "Dream")
    

    Here's the original text I used to create the corpus:

    doc = "To be, or not to be, that is the question—
    Whether 'tis Nobler in the mind to suffer
    The Slings and Arrows of outrageous Fortune,
    Or to take Arms against a Sea of troubles,
    And by opposing, end them? To die, to sleep—
    No more; and by a sleep, to say we end
    The Heart-ache, and the thousand Natural shocks
    That Flesh is heir to? 'Tis a consummation
    Devoutly to be wished. To die, to sleep,
    To sleep, perchance to Dream; Aye, there's the rub"
    
    0 讨论(0)
提交回复
热议问题