问题
How to get a sample of a given size from a large XML file in R?
Unlike reading random lines, which is simple, it is necessary here to preserve the structure of the XML file for R to read it into a proper data.frame.
A possible solution is to read the whole file and then sample rows, but is it possible to read only necessary chunks?
A sample from the file:
<?xml version="1.0" encoding="UTF-8"?>
<products>
<product>
<sku>967190</sku>
<productId>98611</productId>
...
<listingId/>
<sellerId/>
<shippingRestrictions/>
</product>
...
The number of lines for each "product" is not equal. The final number of records is unknown before opening the file.
回答1:
Instead of reading the entire file in, it's possible to use event parsing with a closure
that handles the nodes you're interested in. To get there, I'll start with a strategy for random sampling from a file. Process records one at a time. If the i
th record is less than or equal to the number n
of records to keep then store it, otherwise store it with probability n / i
. This could be implemented as
i <- 0L; n <- 10L
select <- function() {
i <<- i + 1L
if (i <= n)
i
else {
if (runif(1) < n / i)
sample(n, 1)
else
0
}
}
which behaves like this:
> i <- 0L; n <- 10L; replicate(20, select())
[1] 1 2 3 4 5 6 7 8 9 10 1 5 7 0 1 9 0 2 1 0
This tells us to keep the first 10 elements, then we replace element 1 with element 11, element 5 with element 12, element 7 with element 13, then drop the 14th element, etc. Replacements become less frequent as i becomes much larger than n.
We use this as part of a product
handler, which pre-allocates space for the results we're interested in, then each time a 'product' node is encountered we test whether to select and if so, add it to our current results at the appropriate location
sku <- character(n)
product <- function(p) {
i <- select()
if (i)
sku[[i]] <<- xmlValue(p[["sku"]])
NULL
}
The 'select' and 'product' handlers are combined with a function (get
) that allows us to retrieve the current values, and they're all placed in a closure so that we have a kind of factory pattern that encapsulates the variables n
, i
, and sku
sampler <- function(n)
{
force(n) # otherwise lazy evaluation could lead to surprises
i <- 0L
select <- function() {
i <<- i + 1L
if (i <= n) {
i
} else {
if (runif(1) < n / i)
sample(n, 1)
else
0
}
}
sku <- character(n)
product <- function(p) {
i <- select()
if (i)
sku[[i]] <<- xmlValue(p[["sku"]])
NULL
}
list(product=product, get=function() list(sku=sku))
}
And then we're ready to go
products <- xmlTreeParse("foo.xml", handler=sampler(1000))
as.data.frame(products$get())
Once the number of nodes processed i
gets large relative to n
, this will scale linearly with the size of the file, so you can get a sense for whether it performs well enough by starting with subsets of the original file.
回答2:
Here's an example based on the XML file you provided.
xml <- '<?xml version="1.0" encoding="UTF-8"?>
<products>
<product>
<sku>967190</sku>
<productId>98611</productId>
<listingId/>
<sellerId/>
<shippingRestrictions/>
</product>
<product>
<sku>967191</sku>
<productId>98612</productId>
<listingId/>
<sellerId/>
<shippingRestrictions/>
</product>
<product>
<sku>967192</sku>
<productId>98613</productId>
<listingId/>
<sellerId/>
<shippingRestrictions/>
</product>
</products>
'
# parse
p <- xmlParse(xml)
# get nodes
nodes <- xpathApply(p, '//product')
# return a random sample of notes
nodes[sample(seq_along(nodes), 2)]
Here's the result:
> nodes[sample(seq_along(nodes), 2)]
[[1]]
<product>
<sku>967191</sku>
<productId>98612</productId>
<listingId/>
<sellerId/>
<shippingRestrictions/>
</product>
[[2]]
<product>
<sku>967190</sku>
<productId>98611</productId>
<listingId/>
<sellerId/>
<shippingRestrictions/>
</product>
来源:https://stackoverflow.com/questions/20719555/random-sampling-from-xml-file-into-data-frame-in-r