R tm: reloading a 'PCorpus' backend filehash database as corpus (e.g. in restarted session/script)

问题

Having learned loads from answers on this site (thanks!), it's finally time to ask my own question.

I'm using R (tm and lsa packages) to create, clean and simplify, and then run LSA (latent semantic analysis) on, a corpus of about 15,000 text documents. I'm doing this in R 3.0.0 under Mac OS X 10.6.

For efficiency (and to cope with having too little RAM), I've been trying to use either the 'PCorpus' (backend database support supported by the 'filehash' package) option in tm, or the newer 'tm.plugin.dc' option for so-called 'distributed' corpus processing). But I don't really understand how either one works under the bonnet.

An apparent bug using DCorpus with tm_map (not relevant right now) led me to do some of the preprocessing work with the PCorpus option instead. And it takes hours. So I use R CMD BATCH to run a script doing things like:

> # load corpus from predefined directory path,
> # and create backend database to support processing:
> bigCcorp = PCorpus(bigCdir, readerControl = list(load=FALSE), dbControl = list(useDb = TRUE, dbName = "bigCdb", dbType = "DB1"))

> # converting to lower case:
> bigCcorp = tm_map(bigCcorp, tolower)

> # removing stopwords:
> stoppedCcorp = tm_map(bigCcorp, removeWords, stoplist)

Now, supposing my script crashes soon after this point, or I just forget to export the corpus in some other form, and then I restart R. The database is still there on my hard drive, full of nicely tidied-up data. Surely I can reload it back into the new R session, to carry on with the corpus processing, instead of starting all over again?

It feels like a noodle question... but no amount of dbInit() or dbLoad() or variations on the 'PCorpus()' function seem to work. Does anyone know the correct incantation?

I've scoured all the related documentation, and every paper and web forum I can find, but total blank - nobody seems to have done it. Or have I missed it?

回答1:

The original question was from 2013. Meanwhile, in Feb 2015, a duplicate, or similar question, has been answered:

How to reconnect to the PCorpus in the R tm package?. That answer in that post is essential, although pretty minimalist, so I'll try to augment it here.

These are some comments I've just discovered while working on a similar problem:

Note that the dbInit() function is not part of the tm package.

First you need to install the filehash package, which the tm-Documentation only "suggests" to install. This means it is not a hard dependency of tm.

Supposedly, you can also use the filehashSQLite package with library("filehashSQLite") instead of library("filehash"), and both of these packages have the same interface and work seamlesslessly together, due to object-oriented design. So also install "filehashSQLite" (edit 2016: some functions such as tn::content_transformer() are not implemented for filehashSQLite).

then this works:

library(filehashSQLite)
# this string becomes filename, must not contain dots. 
# Example: "mydata.sqlite" is not permitted.
s <- "sqldb_pcorpus_mydata" #replace mydat with something more descriptive 

suppressMessages(library(filehashSQLite))

if(! file.exists(s)){
        # csv is a data frame of 900 documents, 18 cols/features
        pc = PCorpus(DataframeSource(csv), readerControl = list(language = "en"), dbControl = list(dbName = s, dbType = "SQLite"))
        dbCreate(s, "SQLite")
        db <- dbInit(s, "SQLite")
        set.seed(234)
        # add another record, just to show we can.
        # key="test", value = "Hi there"
        dbInsert(db, "test", "hi there")
} else {
        db <- dbInit(s, "SQLite")
        pc <- dbLoad(db)
}



show(pc)
# <<PCorpus>>
# Metadata:  corpus specific: 0, document level (indexed): 0
#Content:  documents: 900
dbFetch(db, "test")
# remove it
rm(db)
rm(pc)

#reload it
db <- dbInit(s, "SQLite")
pc <- dbLoad(db) 

# the corpus entries are now accessible, but not loaded into memory.
# now 900 documents are bound via "Active Bindings", created by makeActiveBinding() from the base package
show(pc)
# [1] "1"    "2"    "3"    "4"    "5"    "6"    "7"    "8"    "9"    
# ...
# [900]
#[883] "883"  "884"  "885"  "886"  "887"  "888"  "889"  "890"  "891"  "892" 
#"893"  "894"  "895"  "896"  "897"  "898"  "899"  "900" 
#[901] "test"

dbFetch(db, "900")
# <<PlainTextDocument>>
#         Metadata:  7
# Content:  chars: 33

dbFetch(db, "test")
#[1] "hi there"

This is what the database backend looks like. You can see that the documents from the data frame have been encoded somehow, inside the sqlite table.

This is what my RStudio IDE shows me:

来源：https://stackoverflow.com/questions/16823739/r-tm-reloading-a-pcorpus-backend-filehash-database-as-corpus-e-g-in-restart

标签

database

text-mining

corpus