问题
Calling read.table()
function (on a CSV
file), as follows:
download.file(url, destfile = file, mode = "w")
conn <- gzcon(bzfile(file, open = "r"))
try(fileData <- read.table(conn, sep = ",", row.names = NULL), silent = FALSE)
produces the following error:
Error in pushBack(c(lines, lines), file) :
can only push back on text-mode connections
I tried to “wrap” the connection explicitly by tConn <- textConnection(readLines(conn))
[and then, certainly, passing tConn
instead of conn
to read.table()
], but it triggered extreme slowness in code execution and eventual hanging or R processes (had to restart R).
UPDATE (That shows again how useful is to try to explain your problems to other people!):
As I was writing this, I decided to go back to documentation and read again on gzcon()
, which I thought not only decompresses bzip2
file, but “labels” it as text. But then I realized that it’s a ridiculous assumption, as I know that it’s a text (CSV
) file inside the bzip2
archive, but R doesn’t. Therefore, my initial attempt to use textConnection()
was the right approach, but something creates a problem. If - and it’s a big IF - my logic is correct until this, the next question is whether the problem is due to textConnection()
or readLines()
.
Please advise. Thank you!
P.S. The CSV files that I'm trying to read are in an "almost" CSV format, so I can't use standard R functions for CSV processing.
===
UPDATE 1 (Program Output):
===
trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectAuthors2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 514960 bytes (502 Kb)
opened URL
==================================================
downloaded 502 Kb
trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectDependencies2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 133295 bytes (130 Kb)
opened URL
==================================================
downloaded 130 Kb
trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectDescriptions2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 5404286 bytes (5.2 Mb)
opened URL
==================================================
downloaded 5.2 Mb
===
UPDATE 2 (Program output):
===
After very long time, I'm getting the following message, then the program continues processing the rest of the files:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 8 elements
Then the situation repeats: after processing several smaller (less than 1MB) files, the program "freezes" on processing a larger (> 1MB) file:
trying URL 'http://flossdata.syr.edu/data/fc/2013/2013-Dec/fcProjectTags2013-Dec.txt.bz2'
Content type 'application/x-bzip2' length 1226391 bytes (1.2 Mb)
opened URL
==================================================
downloaded 1.2 Mb
===
UPDATE 3 (Program output):
===
After giving the program more time to run, I discovered the following:
*) My assumption that file size ~1MB plays role in weird behavior was wrong. This is based on the fact that the program successfully processed files with size > 1MB and could not process files with size < 1MB. This is an example output with errors:
trying URL 'http://flossdata.syr.edu/data/fsf/2012/2012-Nov/fsfProjectInfo2012-Nov.txt.bz2'
Content type 'application/x-bzip2' length 826288 bytes (806 Kb)
opened URL
==================================================
downloaded 806 Kb
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 4 elements
In addition: Warning messages:
1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string
2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string
Example with errors processing very small file:
trying URL 'http://flossdata.syr.edu/data/fsf/2012/2012-Nov/fsfProjectRequirements2012-Nov.txt.bz2'
Content type 'application/x-bzip2' length 3092 bytes
opened URL
==================================================
downloaded 3092 bytes
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 2 did not have 2 elements
From the above examples, it is clear that size is not the factor, but file structure might be.
*) I wrongfully reported the maximum file size, it's 54.2MB compressed. This is the file, which processing not only generates error messages and continues, but it actually triggers an unrecoverable error and stops (exits):
trying URL 'http://flossdata.syr.edu/data/gc/2012/2012-Nov/gcProjectInfo2012-Nov.txt.bz2'
Content type 'application/x-bzip2' length 56793796 bytes (54.2 Mb)
opened URL
=================================================
downloaded 54.2 Mb
Error in textConnection(readLines(conn)) :
cannot allocate memory for text connection
*) After emergency exit, five R processes use 51% of memory each, while after manual R restart, this number remains 7% (data per htop
report).
Even considering the possibility of "very bad" text/CSV format (suggested by "Error in scan() messages"), the behavior of standard R functions textConnection()
and/or readLines()
look to me very strange, even "suspicious". My understanding is that good function should process erroneous input data gracefully, allowing very limited time/retries and then continuing processing, if possible, or exiting when further processing is impossible. In this case we see (via the defect ticket screenshot) that R process is taxing both memory and processor of the virtual machine.
回答1:
When this has happened to me in the past, I get better performance by not using "textConnection". Instead, if I have to do some preprocessing by using 'readLines', I will them write the data to a temporary file and then use that file as input to 'read.table'.
回答2:
You don't have CSV files. I only looked (yes, actually had a look in a text editor) at one of them but they seem to be tab delimited.
url <- 'http://flossdata.syr.edu/data/fsf/2012/2012-Nov/fsfProjectRequirements2012-Nov.txt.bz2'
file <- "temp.txt.bz2"
download.file(url, destfile = file, mode = "w")
dat <- bzfile(file, open = "r")
DF <- read.table(dat, header=TRUE, sep="\t")
close(dat)
head(DF)
# proj_num proj_unixname requirement requirement_type date_collected datasource_id
# 1 14 A2ps E-mail Help,Support 2012-11-02 10:57:40 346
# 2 99 Acct E-mail Bug Tracking 2012-11-02 10:57:40 346
# 3 128 Adns VCS Repository Webview Developer 2012-11-02 10:57:40 346
# 4 128 Adns E-mail Help 2012-11-02 10:57:40 346
# 5 196 AmaroK VCS Repository Webview Bug Tracking 2012-11-02 10:57:40 346
# 6 196 AmaroK Mailing List Info/Archive Bug Tracking,Developer 2012-11-02 10:57:40 346
来源:https://stackoverflow.com/questions/21809092/extremely-slow-r-code-and-hanging