Parse multiple XBRL files stored in a zip file

半世苍凉 提交于 2020-01-12 10:46:33

问题


I have downloaded multiple zip files from a website. Each zip file contains multiple html and xml extension files (~ 100K in each).

It is possible to manually extract the files and then parse them. However, i would like to be able to do this within R (if possible)

Example file (sorry it is a bit big) using code from a previous question - download one zip file

library(XML)

pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
doc <- htmlParse(pth)

myfiles <- doc["//a[contains(text(),'Accounts_Monthly_Data')]", fun = xmlAttrs][[1]]
fileURLS <- file.path("http://download.companieshouse.gov.uk", myfiles) [[1]]

dir.create("temp", "hmrcCache")
download.file(fileURLS, destfile = file.path("temp", myfiles))

I can parse the files using the XBRL package if i manually extract them. This can be done as follows

library(XBRL)     
inst <- file.path("temp", "Prod224_0004_00000121_20130630.html")
out <- xbrlDoAll(inst, cache.dir="temp/hmrcCache", prefix.out=NULL, verbose=T)

I am struggling with how to extract these files from the zip folder and parse each , say, in a loop using R, without manually extracting them. I tried making a start, but don't know how to progress from here. Thanks for any advice.

# Get names of files
lst <- unzip(file.path("temp", myfiles), list=TRUE)
dim(lst) # 118626

# unzip  and extract first file
nms <- lst$Name[1] # Prod224_0004_00000121_20130630.html
lst2 <- unz(file.path("temp", myfiles), filename=nms)

I am using Windows 8.1

R version 3.1.2 (2014-10-31)

Platform: x86_64-w64-mingw32/x64 (64-bit)


回答1:


Using the suggestion from Karsten in the comments, I unzipped the files to a temporary directory, and then parsed each file. I used the snow package to speed things up.

  # Parse one zip file to start
  fls <- list.files(temp)[[1]]

  # Unzip 
  tmp <- tempdir()
  lst <- unzip(file.path(temp, fls), exdir=tmp)

  # Only parse first 10 records
  inst <- lst[1:10]

  # Start to parse - in parallel
  cl <- makeCluster(parallel::detectCores())
  clusterCall(cl, function() library(XBRL))

  # Start
  st <- Sys.time()

  out <- parLapply(cl, inst, function(i) 
                                  xbrlDoAll(i, 
                                            cache.dir="temp/hmrcCache", 
                                            prefix.out=NULL, verbose=T) )

  stopCluster(cl)

  Sys.time() - st

(I am not sure that I am using the tempdir() correctly as this seems to save large amounts of data to the Local\Temp directory - I would welcome comments if I have approached this incorrectly, thanks).



来源:https://stackoverflow.com/questions/29930936/parse-multiple-xbrl-files-stored-in-a-zip-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!