问题
TLDNR: How do I use Sys.glob () within unzip ()?
I have multiple .zip files and I want to extract only one file from each archive.
For example, one of the archives contains the following files:
[1] "cmc-20150531.xml" "cmc-20150531.xsd" "cmc-20150531_cal.xml" "cmc-20150531_def.xml" "cmc-20150531_lab.xml"
[6] "cmc-20150531_pre.xml"
I want to extract the first file because it matches a pattern. In order to do that I use the following command:
unzip("zip-archive.zip", files=Sys.glob("[a-z][a-z][a-z][-][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][.][x][m][l]"))
However, the command doesn't work, and I don't know why. R just extracts all files in the archive.
On the other hand, the following command works:
unzip("zip-archive.zip", files="cmc-20150531.xml")
How do I use Sys.glob() within unzip()?
回答1:
Sys.glob
expands files that already exist. So the parameter to your unzip
call will depend on what files are in your working directory.
Perhaps you want to do unzip
with list=TRUE
to return the list of files in the zip first, and then use some pattern matching to select the files you want.
See ?grep
for info on matching strings with patterns. These patterns are "regular expressions" rather than "glob" expansions, but you should be able to work with that.
Here's a concrete example:
# whats in the zip?
files = unzip("c.zip", list=TRUE)$Name
files
[1] "l_spatial.dbf" "l_spatial.shp" "l_spatial.shx" "ls_polys_bin.dbf"
[5] "ls_polys_bin.shp" "ls_polys_bin.shx" "rast_jan90.tif"
# what files have "dbf" in them:
files[grepl("dbf",files)]
[1] "l_spatial.dbf" "ls_polys_bin.dbf"
# extract just those:
unzip("c.zip", files=files[grepl("dbf",files)])
The regular expression for your glob
"[a-z][a-z][a-z][-][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][.][x][m][l]"
would be
"^[a-z]{3}-[0-9]{8}\\.xml$"
that's a match of start of string ("^"), 3 a-z (lower case only), a dash, eight digits, a dot (backslashes are needed, one because dot means "any one char" in regexps and another because R needs a backslash to escape a backslash), "xml", and the end of the string ("$").
回答2:
Just with any other collections do an itertive loop through the results from Sys.glob and supply the itertive holding variable to unzip. This is achieved by using a for-loop
While unzip() takes an argument for the path, and files is an arugment for what files within that zip file.
Mined you I'm more a full stack programmer not so much so on the R lang, but the concepts are the same; so the code should something like:
files <- Sys.glob(path_expand(".","*.zip"))
for (idx in 1:length(files)) {
results = unzip(files[idx], "*.xml")
}
As for using regex in unzip()
that is something one should read the documentation. I could only advise doing another for-loop
to compare the contest of the zip file to your regex then preforming the extraction. Psudocode follows:
files ::= glob(*.zip)
regex ::=
for idx1 in length(files); do
regex="[a-z]{3}\-[0-9]{8}\.xml"
content = unzip(files[idx1])
for idx2 in length(content); do
if content[idx2].name ~= regex.expand(); then
# do something with found file
end if
end for
end for
Basically your just looping through your list of zip files, then through the list of files within the zip file and comparing the filename from inside your zipfile agenst the regex and extracting/preforming operations on only that file.
来源:https://stackoverflow.com/questions/31146263/sys-glob-within-unzip