I am trying to find an R code for normalisation of my values using min and max value for a two column matrix.
My matrix looks like this: Column one (C1) and C2 I.D not
This is a type of non-parametric normalisation, but I would advise you to use another method: calculate the median and interquartile range, subtract the median and divide by the IQR. This will give you a distribution with median 0 and IQR 1.
m <- median( df$C3, na.rm = T )
iqr <- IQR( df$C3, na.rm = T )
df$C3 <- ( df$C3 - m ) / iqr
The method that you propose is extremely sensitive to outliers. If you really want to do it, this is how:
rng <- range( df$C3, na.rm = T )
df$C3 <- ( df$C3 - rng[1] ) / ( rng[2] - rng[1] )
Given some example data along the lines you describe
set.seed(1)
d <- data.frame(C1 = LETTERS[1:4], C2 = letters[1:4],
C3 = runif(4, min = 0, max = 10),
C4 = runif(4, min = 0, max = 10))
d
then we can write a simple function to do the normalisation you describe
normalise <- function(x, na.rm = TRUE) {
ranx <- range(x, na.rm = na.rm)
(x - ranx[1]) / diff(ranx)
}
This can be applied to the data in a number of ways, but here I use apply()
:
apply(d[, 3:4], 2, normalise)
which gives
R> apply(d[, 3:4], 2, normalise)
C3 C4
[1,] 0.0000000 0.0000000
[2,] 0.1658867 0.9377039
[3,] 0.4782093 1.0000000
[4,] 1.0000000 0.6179273
To add these to the existing data, we could do:
d2 <- data.frame(d, apply(d[, 3:4], 2, normalise))
d2
Which gives:
R> d2
C1 C2 C3 C4 C3.1 C4.1
1 A a 2.655087 2.016819 0.0000000 0.0000000
2 B b 3.721239 8.983897 0.1658867 0.9377039
3 C c 5.728534 9.446753 0.4782093 1.0000000
4 D d 9.082078 6.607978 1.0000000 0.6179273
Now you mentioned that your data include NA
and we must handle that. You may have noticed that I set the na.rm
argument to TRUE
in the normalise()
function. This means it will work even in the presence of NA
:
d3 <- d
d3[c(1,3), c(3,4)] <- NA ## set some NA
d3
R> d3
C1 C2 C3 C4
1 A a NA NA
2 B b 3.721239 8.983897
3 C c NA NA
4 D d 9.082078 6.607978
With normalise()
we still get some output that is of use, using only the non-NA
data:
R> apply(d3[, 3:4], 2, normalise)
C3 C4
[1,] NA NA
[2,] 0 1
[3,] NA NA
[4,] 1 0
If we had not done this in writing normalise()
, then the output would look something like this (na.rm = FALSE
is the default for range()
and other similar functions!)
R> apply(d3[, 3:4], 2, normalise, na.rm = FALSE)
C3 C4
[1,] NA NA
[2,] NA NA
[3,] NA NA
[4,] NA NA