问题
I am working with unstructured text (Facebook) data, and am pre-processing it (e.g., stripping punctuation, removing stop words, stemming). I need to retain the record (i.e., Facebook post) ids while pre-processing. I have a solution that works on a subset of the data but fails with all the data (N = 127K posts). I have tried chunking the data, and that doesn't work either. I think it has something to do with me using a work-around, and relying on row names. For example, it appears to work with the first ~15K posts but when I keep subsetting, it fails. I realize my code is less than elegant so happy to learn better/completely different solutions - all I care about is keeping the IDs when I go to V Corpus and then back again. I'm new to the tm package and the readTabular function in particular. (Note: I ran the to lower and remove Words before making the VCorpus as I originally thought that was part of the issue).
Working code is below:
Sample data
fb = data.frame(RecordContent = c("I'm dating a celebrity! Skip to 2:02 if you, like me, don't care about the game.",
"Photo fails of this morning. Really Joe?",
"This piece has been almost two years in the making. Finally finished! I'm antsy for October to come around... >:)"),
FromRecordId = c(682245468452447, 737891849554475, 453178808037464),
stringsAsFactors = F)
Remove punctuation & make lower case
fb$RC = tolower(gsub("[[:punct:]]", "", fb$RecordContent))
fb$RC2 = removeWords(fb$RC, stopwords("english"))
Step 1: Create special reader function to retain record IDs
myReader = readTabular(mapping=list(content="RC2", id="FromRecordId"))
Step 2: Make my corpus. Read in the data using DataframeSource and the custom reader function where each FB post is a "document"
corpus.test = VCorpus(DataframeSource(fb), readerControl=list(reader=myReader))
Step 3: Clean and stem
corpus.test2 = corpus.test %>%
tm_map(removeNumbers) %>%
tm_map(stripWhitespace) %>%
tm_map(stemDocument, language = "english") %>%
as.VCorpus()
Step 4: Make the corpus back into a character vector. The row names are now the IDs
fb2 = data.frame(unlist(sapply(corpus.test2, `[`, "content")), stringsAsFactors = F)
Step 5: Make new ID variable for later merge, name vars, and prep for merge back onto original dataset
fb2$ID = row.names(fb2)
fb2$RC.ID = gsub(".content", "", fb2$ID)
colnames(fb2)[1] = "RC.stem"
fb3 = select(fb2, RC.ID, RC.stem)
row.names(fb3) = NULL
回答1:
I think the ids are being stored and retained by default, by the tm
module. You can fetch them all (in a vectorized manner) with
meta(corpus.test, "id")
$`682245468452447`
[1] "682245468452447"
$`737891849554475`
[1] "737891849554475"
$`453178808037464`
[1] "453178808037464"
I'd recommend to read the documentation of the the tm::meta()
function, but it's not very good.
You can also add arbitrary metadata (as key-value pairs) to each collection item in the corpus, as well as collection-level metadata.
来源:https://stackoverflow.com/questions/43484060/retaining-unique-identifiers-e-g-record-id-when-using-tm-functions-doesnt