问题
How do you print a small sample, or first line, of a corpus in R using the tm package? I have a very large corpus ( > 1 GB) and am doing some text cleaning. I would like to test as I apply cleaning procedures. Printing just the first line, or first few lines of a corpus would be ideal.
# Load Libraries
library(tm)
# Read in Corpus
corp <- SimpleCorpus( DirSource(
"C:/TextDocument"))
# Remove puncuation
corp <- removePunctuation(corp,
preserve_intra_word_contractions = TRUE,
preserve_intra_word_dashes = TRUE)
I have tried accessing the corpus several ways:
# Print first line of first element of corpus
corp[[1]][[1]]
# Print first line using 'content' element of corpus
corp[[1]]$content[[1]]
Both of these result in very long run times without the desired output.
The crude corpus in the tm package can be used for example purposes.
data("crude")
回答1:
strwrap
does this job nicely since it prints your paragraphs formatted by breaking lines at word boundaries
. (See ?strwrap
.) Then you can use the head
function to see the first 6 lines.
head(strwrap(corp))
来源:https://stackoverflow.com/questions/49951708/print-first-line-of-one-element-of-corpus-in-r-using-tm-package