问题
I am using the R programming language. I want to learn how to measure and plot the run time of difference procedures as the size of the data increases.
I found a previous stackoverflow post that answers a similar question: Plot the run time of three functions
It seems that the "microbenchmark" library in R should be able to accomplish this task.
Suppose I simulate the following data:
#load libraries
library(microbenchmark)
library(dplyr)
library(ggplot2)
library(Rtsne)
library(cluster)
library(dbscan)
library(plotly)
#simulate data
var_1 <- rnorm(1000,1,4)
var_2<-rnorm(1000,10,5)
var_3 <- sample( LETTERS[1:4], 1000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
var_4 <- sample( LETTERS[1:2], 1000, replace=TRUE, prob=c(0.4, 0.6) )
#put them into a data frame called "f"
f <- data.frame(var_1, var_2, var_3, var_4)
#declare var_3 and response_variable as factors
f$var_3 = as.factor(f$var_3)
f$var_4 = as.factor(f$var_4)
#add id
f$ID <- seq_along(f[,1])
Now, I want to measure the run time of 7 different procedures:
#Procedure 1: :
gower_dist <- daisy(f[,-5],
metric = "gower")
gower_mat <- as.matrix(gower_dist)
#Procedure 2
lof <- lof(gower_dist, k=3)
#Procedure 3
lof <- lof(gower_dist, k=5)
#Procedure 4
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(
name = f$ID)
#Procedure 5
tsne_obj <- Rtsne(gower_dist, perplexity =10, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(
name = f$ID)
#Procedure 6
plot = ggplot(aes(x = X, y = Y), data = tsne_data) + geom_point(aes())
#Procedure 7
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(
name = f$ID,
lof=lof,
var1=f$var_1,
var2=f$var_2,
var3=f$var_3
)
p1 <- ggplot(aes(x = X, y = Y, size=lof, key=name, var1=var1,
var2=var2, var3=var3), data = tsne_data) +
geom_point(shape=1, col="red")+
theme_minimal()
ggplotly(p1, tooltip = c("lof", "name", "var1", "var2", "var3"))
Using the "microbenchmark" library, I can find out the time of individual functions:
procedure_1_part_1 <- microbenchmark(daisy(f[,-5],
metric = "gower"))
procedure_1_part_2 <- microbenchmark(as.matrix(gower_dist))
I want to make a graph of the run times like this:
https://umap-learn.readthedocs.io/en/latest/benchmarking.html
Question: Can someone please show me how to make this graph and use the microbenchmark statement for multiple functions at once (for different sizes of the dataframe "f" (for f = 5, 10, 50, 100, 200, 500, 100)?
microbench(cbind(gower_dist <- daisy(f[1:5,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))
microbench(cbind(gower_dist <- daisy(f[1:10,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))
microbench(cbind(gower_dist <- daisy(f[1:50,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))
etc
There does not seem to be a straightforward way to do this in R:
mean(procedure_1_part_1$time)
[1] NA
Warning message:
In mean.default(procedure_1_part_1) :
argument is not numeric or logical: returning NA
I could manually run each one of these, copy the results into excel and plot them, but this would also take a long time.
tm <- microbenchmark( daisy(f[,-5],
metric = "gower"),
as.matrix(gower_dist))
tm
Unit: microseconds
expr min lq mean median uq max neval cld
daisy(f[, -5], metric = "gower") 2071.9 2491.4 3144.921 3563.65 3621.00 4727.8 100 b
as.matrix(gower_dist) 129.3 147.5 194.709 180.80 232.45 414.2 100 a
Is there a quicker way to make a graph?
Thanks
回答1:
Here is a solution that benchmarks & charts the first three procedures from the original post, and then charts their average run times with ggplot()
.
Setup
We start the process by executing the code necessary to create the data from the original post.
library(dplyr)
library(ggplot2)
library(Rtsne)
library(cluster)
library(dbscan)
library(plotly)
library(microbenchmark)
#simulate data
var_1 <- rnorm(1000,1,4)
var_2<-rnorm(1000,10,5)
var_3 <- sample( LETTERS[1:4], 1000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
var_4 <- sample( LETTERS[1:2], 1000, replace=TRUE, prob=c(0.4, 0.6) )
#put them into a data frame called "f"
f <- data.frame(var_1, var_2, var_3, var_4,ID=1:1000)
#declare var_3 and response_variable as factors
f$var_3 = as.factor(f$var_3)
f$var_4 = as.factor(f$var_4)
Automation of the benchmarking process by data frame size
First, we create a vector of data frame sizes to drive the benchmarking.
# configure run sizes
sizes <- c(5,10,50,100,200,500,1000)
Next, we take the first procedure and alter it so we can vary the number of observations that are used from the data frame f
. Note that since we need to use the outputs from this procedure in subsequent steps, we use assign()
to write them to the global environment. We also include the number of observations in the object name so we can retrieve them by size in subsequent steps.
# Procedure 1: :
proc1 <- function(size){
assign(paste0("gower_dist_",size), daisy(f[1:size,-5],
metric = "gower"),envir = .GlobalEnv)
assign(paste0("gower_mat_",size),as.matrix(get(paste0("gower_dist_",size),envir = .GlobalEnv)),
envir = .GlobalEnv)
}
To run the benchmark by data frame size we use the sizes
vector with lapply()
and an anonymous function that executes proc1()
repeatedly. We also assign the number of observations to a column called obs
so we can use it in the plot.
proc1List <- lapply(sizes,function(x){
b <- microbenchmark(proc1(x))
b$obs <- x
b
})
At this point we have one data frame per benchmark based on size. We combine the benchmarks into a single data frame with do.call()
and rbind()
.
proc1summary <- do.call(rbind,(proc1List))
Next, we use the same process with procedures 2 and 3. Notice how we use get()
with paste0()
to retrieve the correct gower_dist
objects by size.
#Procedure 2
proc2 <- function(size){
lof <- lof(get(paste0("gower_dist_",size),envir = .GlobalEnv), k=3)
}
proc2List <- lapply(sizes,function(x){
b <- microbenchmark(proc2(x))
b$obs <- x
b
})
proc2summary <- do.call(rbind,(proc2List))
#Procedure 3
proc3 <- function(size){
lof <- lof(get(paste0("gower_dist_",size),envir = .GlobalEnv), k=5)
}
Since k
must be less than the number of observations, we adjust the sizes
vector to start at 10 for procedure 3.
# configure run sizes
sizes <- c(10,50,100,200,500,1000)
proc3List <- lapply(sizes,function(x){
b <- microbenchmark(proc3(x))
b$obs <- x
b
})
proc3summary <- do.call(rbind,(proc3List))
Having generated runtime benchmarks for each of the first three procedures, we bind the summary data, summarize to means with dplyr::summarise()
, and plot with ggplot()
.
do.call(rbind,list(proc1summary,proc2summary,proc3summary)) %>%
group_by(expr,obs) %>%
summarise(.,time_ms = mean(time) * .000001) -> proc_time
The resulting data frame has all the information we need to produce the chart: the procedure used, the number of observations in the original data frame, and the average time in milliseconds.
> head(proc_time)
# A tibble: 6 x 3
# Groups: expr [1]
expr obs time_ms
<fct> <dbl> <dbl>
1 proc1(x) 5 0.612
2 proc1(x) 10 0.957
3 proc1(x) 50 1.32
4 proc1(x) 100 2.53
5 proc1(x) 200 5.78
6 proc1(x) 500 25.9
Finally, we use ggplot()
to produce an x y chart, grouping the lines by procedure used.
ggplot(proc_time,aes(obs,time_ms,group = expr)) +
geom_line(aes(group = expr),color = "grey80") +
geom_point(aes(color = expr))
...and the output:
Since procedures 2 and 3 vary only slightly, k = 3
vs. k = 5
, they are almost indistinguishable in the chart.
Conclusions
With a combination of wrapper functions and lapply()
we can generate the information needed to produce the chart requested in the original post.
The general pattern of modifications is:
- Wrap the original procedure in a function that we can use as the unit of analysis for
microbenchmark()
, and include asize
argument - Modify the procedure to use
size
as a variable where necessary - Modify the procedure to access objects from previous steps, based on the
size
argument - Modify the procedure to write its outputs with
assign()
andsize
if these are needed for subsequent procedure steps
We leave automation of benchmarking procedures 4 - 7 by data frame size and integrating them into the plot as an interesting exercise for the original poster.
回答2:
My first answer severely misunderstood your question. I hope this can be of some help.
library(tidyverse)
library(broom)
# Benchmark your expressions. The following script assumes you name the benchmarks as function_n, but this can (and should be) improved on.
res = microbenchmark(
rnorm_100 = rnorm(100),
runif_100 = runif(100),
rnorm_1000 = runif(1000),
runif_1000 = runif(1000)
)
# We will be using this gist to tidy the frame
# Source: https://gist.github.com/nutterb/e9e6da4525bacac99899168b5d2f07be
tidy.microbenchmark <- function(x, unit, ...){
summary(x, unit = unit)
}
# Tidy the frame
res_tidy = tidy(res) %>%
mutate(expr = as.character(expr)) %>%
separate(expr, c("func","n"), remove = FALSE)
res_tidy
#> expr func n min lq mean median uq max neval
#> 1 rnorm_100 rnorm 100 8.112 9.3420 10.58302 10.2915 10.9755 44.903 100
#> 2 runif_100 runif 100 4.487 5.1180 6.12284 6.1990 6.5925 10.907 100
#> 3 rnorm_1000 rnorm 1000 34.631 36.3155 37.78117 37.2665 38.4510 62.951 100
#> 4 runif_1000 runif 1000 34.668 36.6330 39.48718 37.7995 39.2905 105.325 100
# Plot the runtime for the different expressions by sample number
ggplot(res_tidy, aes(x = n, y = mean, group = func, col = func)) +
geom_line() +
geom_point() +
labs(y = "Runtime", x = "n")
Created on 2020-12-26 by the reprex package (v0.3.0)
来源:https://stackoverflow.com/questions/65458335/r-using-microbenchmark-and-ggplot2-to-plot-runtimes