R: Using “microbenchmark” and ggplot2 to plot runtimes

问题

I am using the R programming language. I want to learn how to measure and plot the run time of difference procedures as the size of the data increases.

I found a previous stackoverflow post that answers a similar question: Plot the run time of three functions

It seems that the "microbenchmark" library in R should be able to accomplish this task.

Suppose I simulate the following data:

#load libraries

library(microbenchmark)
library(dplyr)
library(ggplot2)
library(Rtsne)
library(cluster)
library(dbscan)
library(plotly)

#simulate data

var_1 <- rnorm(1000,1,4)
var_2<-rnorm(1000,10,5)
var_3 <- sample( LETTERS[1:4], 1000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
var_4 <- sample( LETTERS[1:2], 1000, replace=TRUE, prob=c(0.4, 0.6) )


#put them into a data frame called "f"
f <- data.frame(var_1, var_2, var_3, var_4)

#declare var_3 and response_variable as factors
f$var_3 = as.factor(f$var_3)
f$var_4 = as.factor(f$var_4)

#add id
f$ID <- seq_along(f[,1])
Now, I want to measure the run time of 7 different procedures:

#Procedure 1: :

gower_dist <- daisy(f[,-5],
                    metric = "gower")

gower_mat <- as.matrix(gower_dist)


#Procedure 2

lof <- lof(gower_dist, k=3)

#Procedure 3

lof <- lof(gower_dist, k=5)

#Procedure 4

tsne_obj <- Rtsne(gower_dist,  is_distance = TRUE)

tsne_data <- tsne_obj$Y %>%
    data.frame() %>%
    setNames(c("X", "Y")) %>%
    mutate(
           name = f$ID)

#Procedure 5

tsne_obj <- Rtsne(gower_dist, perplexity =10,  is_distance = TRUE)

tsne_data <- tsne_obj$Y %>%
    data.frame() %>%
    setNames(c("X", "Y")) %>%
    mutate(
           name = f$ID)

#Procedure 6

plot = ggplot(aes(x = X, y = Y), data = tsne_data) + geom_point(aes())

#Procedure 7

tsne_obj <- Rtsne(gower_dist,  is_distance = TRUE)

tsne_data <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(
    name = f$ID, 
    lof=lof,
    var1=f$var_1,
    var2=f$var_2,
    var3=f$var_3
    )

p1 <- ggplot(aes(x = X, y = Y, size=lof, key=name, var1=var1, 
  var2=var2, var3=var3), data = tsne_data) + 
  geom_point(shape=1, col="red")+
  theme_minimal()

ggplotly(p1, tooltip = c("lof", "name", "var1", "var2", "var3"))

Using the "microbenchmark" library, I can find out the time of individual functions:

procedure_1_part_1 <- microbenchmark(daisy(f[,-5],
                    metric = "gower"))

procedure_1_part_2 <-  microbenchmark(as.matrix(gower_dist))

I want to make a graph of the run times like this:

https://umap-learn.readthedocs.io/en/latest/benchmarking.html

Question: Can someone please show me how to make this graph and use the microbenchmark statement for multiple functions at once (for different sizes of the dataframe "f" (for f = 5, 10, 50, 100, 200, 500, 100)?

microbench(cbind(gower_dist <- daisy(f[1:5,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))

microbench(cbind(gower_dist <- daisy(f[1:10,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))

microbench(cbind(gower_dist <- daisy(f[1:50,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))

etc

There does not seem to be a straightforward way to do this in R:

mean(procedure_1_part_1$time)
[1] NA

Warning message:
In mean.default(procedure_1_part_1) :
  argument is not numeric or logical: returning NA

I could manually run each one of these, copy the results into excel and plot them, but this would also take a long time.

 tm <- microbenchmark( daisy(f[,-5],
                        metric = "gower"),
    as.matrix(gower_dist))

 tm
Unit: microseconds
                             expr    min     lq     mean  median      uq    max neval cld
 daisy(f[, -5], metric = "gower") 2071.9 2491.4 3144.921 3563.65 3621.00 4727.8   100   b
            as.matrix(gower_dist)  129.3  147.5  194.709  180.80  232.45  414.2   100  a

Is there a quicker way to make a graph?

Thanks

回答1:

Here is a solution that benchmarks & charts the first three procedures from the original post, and then charts their average run times with ggplot().

Setup

We start the process by executing the code necessary to create the data from the original post.

library(dplyr)
library(ggplot2)
library(Rtsne)
library(cluster)
library(dbscan)
library(plotly)
library(microbenchmark)

#simulate data

var_1 <- rnorm(1000,1,4)
var_2<-rnorm(1000,10,5)
var_3 <- sample( LETTERS[1:4], 1000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
var_4 <- sample( LETTERS[1:2], 1000, replace=TRUE, prob=c(0.4, 0.6) )

#put them into a data frame called "f"
f <- data.frame(var_1, var_2, var_3, var_4,ID=1:1000)

#declare var_3 and response_variable as factors
f$var_3 = as.factor(f$var_3)
f$var_4 = as.factor(f$var_4)

Automation of the benchmarking process by data frame size

First, we create a vector of data frame sizes to drive the benchmarking.

# configure run sizes
sizes <- c(5,10,50,100,200,500,1000)

Next, we take the first procedure and alter it so we can vary the number of observations that are used from the data frame f. Note that since we need to use the outputs from this procedure in subsequent steps, we use assign() to write them to the global environment. We also include the number of observations in the object name so we can retrieve them by size in subsequent steps.

# Procedure 1: :
proc1 <- function(size){
    assign(paste0("gower_dist_",size), daisy(f[1:size,-5],
                        metric = "gower"),envir = .GlobalEnv)
        
    assign(paste0("gower_mat_",size),as.matrix(get(paste0("gower_dist_",size),envir = .GlobalEnv)),
           envir = .GlobalEnv)
        
}

To run the benchmark by data frame size we use the sizes vector with lapply() and an anonymous function that executes proc1() repeatedly. We also assign the number of observations to a column called obs so we can use it in the plot.

proc1List <- lapply(sizes,function(x){
        b <- microbenchmark(proc1(x))
        b$obs <- x
        b
})

At this point we have one data frame per benchmark based on size. We combine the benchmarks into a single data frame with do.call() and rbind().

proc1summary <- do.call(rbind,(proc1List))

Next, we use the same process with procedures 2 and 3. Notice how we use get() with paste0() to retrieve the correct gower_dist objects by size.

#Procedure 2

proc2 <- function(size){
        lof <- lof(get(paste0("gower_dist_",size),envir = .GlobalEnv), k=3)
}
proc2List <- lapply(sizes,function(x){
    b <- microbenchmark(proc2(x))
    b$obs <- x
    b
})
proc2summary <- do.call(rbind,(proc2List))

#Procedure 3

proc3 <- function(size){
    lof <- lof(get(paste0("gower_dist_",size),envir = .GlobalEnv), k=5)
}

Since k must be less than the number of observations, we adjust the sizes vector to start at 10 for procedure 3.

# configure run sizes
sizes <- c(10,50,100,200,500,1000)

proc3List <- lapply(sizes,function(x){
    b <- microbenchmark(proc3(x))
    b$obs <- x
    b
})
proc3summary <- do.call(rbind,(proc3List))

Having generated runtime benchmarks for each of the first three procedures, we bind the summary data, summarize to means with dplyr::summarise(), and plot with ggplot().

do.call(rbind,list(proc1summary,proc2summary,proc3summary)) %>% 
    group_by(expr,obs) %>%
    summarise(.,time_ms = mean(time) * .000001) -> proc_time

The resulting data frame has all the information we need to produce the chart: the procedure used, the number of observations in the original data frame, and the average time in milliseconds.

> head(proc_time)
# A tibble: 6 x 3
# Groups:   expr [1]
  expr       obs time_ms
  <fct>    <dbl>   <dbl>
1 proc1(x)     5   0.612
2 proc1(x)    10   0.957
3 proc1(x)    50   1.32 
4 proc1(x)   100   2.53 
5 proc1(x)   200   5.78 
6 proc1(x)   500  25.9

Finally, we use ggplot() to produce an x y chart, grouping the lines by procedure used.

ggplot(proc_time,aes(obs,time_ms,group = expr)) +
    geom_line(aes(group = expr),color = "grey80") + 
    geom_point(aes(color = expr))

...and the output:

Since procedures 2 and 3 vary only slightly, k = 3 vs. k = 5, they are almost indistinguishable in the chart.

Conclusions

With a combination of wrapper functions and lapply() we can generate the information needed to produce the chart requested in the original post.

The general pattern of modifications is:

Wrap the original procedure in a function that we can use as the unit of analysis for microbenchmark(), and include a size argument
Modify the procedure to use size as a variable where necessary
Modify the procedure to access objects from previous steps, based on the size argument
Modify the procedure to write its outputs with assign() and size if these are needed for subsequent procedure steps

We leave automation of benchmarking procedures 4 - 7 by data frame size and integrating them into the plot as an interesting exercise for the original poster.

回答2:

My first answer severely misunderstood your question. I hope this can be of some help.

library(tidyverse)
library(broom)

# Benchmark your expressions. The following script assumes you name the benchmarks as function_n, but this can (and should be) improved on.
res = microbenchmark(
  rnorm_100 = rnorm(100),
  runif_100 = runif(100),
  rnorm_1000 = runif(1000),
  runif_1000 = runif(1000)
)

# We will be using this gist to tidy the frame
# Source: https://gist.github.com/nutterb/e9e6da4525bacac99899168b5d2f07be
tidy.microbenchmark <- function(x, unit, ...){
  summary(x, unit = unit)
}

# Tidy the frame
res_tidy = tidy(res) %>% 
  mutate(expr = as.character(expr)) %>% 
  separate(expr, c("func","n"), remove = FALSE)

res_tidy
#>         expr  func    n    min      lq     mean  median      uq     max neval
#> 1  rnorm_100 rnorm  100  8.112  9.3420 10.58302 10.2915 10.9755  44.903   100
#> 2  runif_100 runif  100  4.487  5.1180  6.12284  6.1990  6.5925  10.907   100
#> 3 rnorm_1000 rnorm 1000 34.631 36.3155 37.78117 37.2665 38.4510  62.951   100
#> 4 runif_1000 runif 1000 34.668 36.6330 39.48718 37.7995 39.2905 105.325   100

# Plot the runtime for the different expressions by sample number
ggplot(res_tidy, aes(x = n, y = mean, group = func, col = func)) +
  geom_line() +
  geom_point() +
  labs(y = "Runtime", x = "n")

^{Created on 2020-12-26 by the reprex package (v0.3.0)}

来源：https://stackoverflow.com/questions/65458335/r-using-microbenchmark-and-ggplot2-to-plot-runtimes

标签

loops

ggplot2

plotly

runtime