When I look at the source of R Packages, i see the function sweep
used quite often.
Sometimes it\'s used when a simpler function would have sufficed (e.g.,
sweep()
is typically used when you operate a matrix by row or by column, and the other input of the operation is a different value for each row / column. Whether you operate by row or column is defined by MARGIN, as for apply()
. The values used for what I called "the other input" is defined by STATS.
So, for each row (or column), you will take a value from STATS and use in the operation defined by FUN.
For instance, if you want to add 1 to the 1st row, 2 to the 2nd, etc. of the matrix you defined, you will do:
sweep (M, 1, c(1: 4), "+")
I frankly did not understand the definition in the R documentation either, I just learned by looking up examples.
This question is a bit old, but since I've recently faced this problem a typical use of sweep can be found in the source code for the stats function cov.wt
, used for computing weighted covariance matrices. I'm looking at the code in R 3.0.1. Here sweep
is used to subtract out column means before computing the covariance. On line 19 of the code the centering vector is derived:
center <- if (center)
colSums(wt * x)
else 0
and on line 54 it is swept out of the matrix
x <- sqrt(wt) * sweep(x, 2, center, check.margin = FALSE)
The author of the code is using the default value FUN = "-"
, which confused me for a while.
sweep() can be great for systematically manipulating a large matrix either column by column, or row by row, as shown below:
> print(size)
Weight Waist Height
[1,] 130 26 140
[2,] 110 24 155
[3,] 118 25 142
[4,] 112 25 175
[5,] 128 26 170
> sweep(size, 2, c(10, 20, 30), "+")
Weight Waist Height
[1,] 140 46 170
[2,] 120 44 185
[3,] 128 45 172
[4,] 122 45 205
[5,] 138 46 200
Granted, this example is simple, but changing the STATS and FUN argument, other manipulations are possible.
One use is when you're computing weighted sums for an array. Where rowSums
or colSums
can be assumed to mean 'weights=1', sweep
can be used prior to this to give a weighted result. This is particularly useful for arrays with >=3 dimensions.
This comes up e.g. when calculating a weighted covariance matrix as per @James King's example.
Here's another based on a current project:
set.seed(1)
## 2x2x2 array
a1 <- array(as.integer(rnorm(8, 10, 5)), dim=c(2, 2, 2))
## 'element-wise' sum of matrices
## weights = 1
rowSums(a1, dims=2)
## weights
w1 <- c(3, 4)
## a1[, , 1] * 3; a1[, , 2] * 4
a1 <- sweep(a1, MARGIN=3, STATS=w1, FUN="*")
rowSums(a1, dims=2)
You could use sweep
function to scale and center data like the following code. Note that means
and sds
are arbitrary here (you may have some reference values that you want to standardize data based on them):
df=matrix(sample.int(150, size = 100, replace = FALSE),5,5)
df_means=t(apply(df,2,mean))
df_sds=t(apply(df,2,sd))
df_T=sweep(sweep(df,2,df_means,"-"),2,df_sds,"/")*10+50
This code convert raw scores to T scores (with mean=50 and sd=10):
> df
[,1] [,2] [,3] [,4] [,5]
[1,] 109 8 89 69 15
[2,] 85 13 25 150 26
[3,] 30 79 48 1 125
[4,] 56 74 23 140 100
[5,] 136 110 112 12 43
> df_T
[,1] [,2] [,3] [,4] [,5]
[1,] 56.15561 39.03218 57.46965 49.22319 40.28305
[2,] 50.42946 40.15594 41.31905 60.87539 42.56695
[3,] 37.30704 54.98946 47.12317 39.44109 63.12203
[4,] 43.51037 53.86571 40.81435 59.43685 57.93136
[5,] 62.59752 61.95672 63.27377 41.02349 46.09661