In C# I am used to the concept of a data set and a current record. It would be easy for me to write a complicated calc-price function with conditions on the current record.
Vectorization is one of the most fundamental (and unusual) things you'll need to get used to in R. Many (most?) R operations are vectorized. But a few things aren't - and if(){}else{}
is one of the non-vectorized things. It's used for control flow (whether or not to run a code block) not for vector operations. ifelse()
is a separate function that is used for vectors, where the first argument is a "test", and the 2nd and 3rd arguments are the "if yes" and "if no" results. The test is a vector, and the returned value is the appropriate yes/no result for each item in test. The result will be the same length as the test.
So we would write your IsPretty
function like this:
IsPretty <- function(PetalWidth){
return(ifelse(PetalWidth > 0.3, "Y", "N"))
}
df <- iris
df$Pretty = IsPretty(df$Petal.Width)
Contrast to an if(){...}else{...}
block where the test condition is of length one, and arbitrary code can be run in the ...
- may return a bigger result than the test, or a smaller result, or no result - might modify other objects... You can do anything inside if(){}else()
, but the test condition must have length 1.
You could use your IsPretty
function one row at a time - it will work fine for any one row. So we could put it in a loop as below, checking one row at time, giving if()
one test at a time, assigning results one at a time. But R is optimized for vectorization, and this will be noticeably slower and is a bad habit.
IsPrettyIf <-function(PetalWidth){
if (PetalWidth >0.3) return("Y")
return("N")
}
for(i in 1:nrow(df)) {
df$PrettyLoop[i] = IsPrettyIf(df$Petal.Width[i])
}
A benchmark below shows that the vectorized version is 50x faster. This is such a simple case and such small data that it doesn't much matter, but on larger data, or with more complex operations the difference between vectorized and non-vectorized code can be minutes vs days.
microbenchmark::microbenchmark(
loop = {
for(i in 1:nrow(df)) {
df$PrettyLoop[i] = IsPrettyIf(df$Petal.Width[i])
}
},
vectorized = {
df$Pretty = IsPretty(df$Petal.Width)
}
)
Unit: microseconds
expr min lq mean median uq max neval
loop 3898.9 4365.6 5880.623 5442.3 7041.10 11344.6 100
vectorized 47.7 59.6 112.288 67.4 83.85 1819.4 100
This is a common bump for R learners - you can find many questions on Stack Overflow where people are using if(){}else{}
when they need ifelse()
or vice versa. Why can't ifelse return vectors? is a FAQ coming from the opposite side of the problem.
df <- iris
## The condition has length equal to the number of rows in the data frame
df$Petal.Width > 0.3
#> [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#> [13] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## ... truncated
## R warns us that only the first value (which happens to be FALSE) is used
result = if(df$Petal.Width > 0.3) {"Y"} else {"N"}
#> Warning in if (df$Petal.Width > 0.3) {: the condition has length > 1 and only
#> the first element will be used
## So the result is a single "N"
result
#> [1] "N"
length(result)
#> [1] 1
## R "recycles" inputs that are of insufficient length
## so we get a full column of "N"
df$Pretty = result
head(df)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Pretty
#> 1 5.1 3.5 1.4 0.2 setosa N
#> 2 4.9 3.0 1.4 0.2 setosa N
#> 3 4.7 3.2 1.3 0.2 setosa N
#> 4 4.6 3.1 1.5 0.2 setosa N
#> 5 5.0 3.6 1.4 0.2 setosa N
#> 6 5.4 3.9 1.7 0.4 setosa N
Created on 2020-11-08 by the reprex package (v0.3.0)
For my own notes on Gregor's answer
IsPrettyIf <-function(row){
ret ="N"
if(row$Petal.Width > 0.3) { ret="Y"}
return(ret)
}
df <- iris
df$PrettyLoop ="" # add a column and initialize all the cells to be empty
for(i in 1:5) {
df$PrettyLoop[i] = IsPrettyIf(df[i,])
cat("Row",i, "is Pretty?",df$PrettyLoop[i],"\n")
}
The bit that trips me up is that row$PrettyLoop is like a cell and df$PrettyLoop is like a column, thinking with the spreadsheet analogy.