I have following dataframe in R
ID Season Year Weekday
1 Winter 2017 Monday
2 Winter 2018 Tuesday
3
We can use match
with unique
elements
library(dplyr)
dat %>%
mutate_all(funs(match(., unique(.))))
# ID Season Year Weekday
#1 1 1 1 1
#2 2 1 2 2
#3 3 2 1 1
#4 4 2 2 3
m=dat
> m[]=lapply(dat,function(x)as.integer(factor(x,unique(x))))
> m
ID Season Year Weekday
1 1 1 1 1
2 2 1 2 2
3 3 2 1 1
4 4 2 2 3
You can simply use as.numeric()
to convert a factor to a numeric. Each value will be changed to the corresponding integer that that factor level represents:
library(dplyr)
### Change factor levels to the levels you specified
otest_xgb$Season <- factor(otest_xgb$Season , levels = c("Winter", "Summer"))
otest_xgb$Year <- factor(otest_xgb$Year , levels = c(2017, 2018))
otest_xgb$Weekday <- factor(otest_xgb$Weekday, levels = c("Monday", "Tuesday", "Wednesday"))
otest_xgb %>%
dplyr::mutate_at(c("Season", "Year", "Weekday"), as.numeric)
# ID Season Year Weekday
# 1 1 1 1 1
# 2 2 1 2 2
# 3 3 2 1 1
# 4 4 2 2 NA
Once you have converted the season, year and weekday to factors, use this code to change to dummy indicator variables
contrasts(factor(dat$season)
contrasts(factor(dat$year)
contrasts(factor(dat$weekday)
Ordered and Nominal factor variables are needed to be taken care of separately. Directly converting a factor column to integer or numeric will provide values in lexicographical sense.
Here Weekday
is conceptually ordinal, Year
is integer, Season
is generally nominal. However, this is again subjective depending on the kind of analysis required.
For eg. When you directly convert from factor to integer variables. In Weekday
column, Wednesday
will get a higher value than both Saturday and Tuesday:
dat[] <- lapply(dat, function(x)as.integer(factor(x)))
dat
# ID Season Year Weekday
#1 1 2 1 1
#2 2 2 2 3
#3 3 1 1 2 (Saturday)
#4 4 1 2 4 (Wednesday): assigned value greater than that ofSaturday
Therefore, you can convert directly from factor to integers for Season
and Year
columns only. It might be noted that for year
column, it works fine as the lexicographical sense matches with its ordinal sense.
dat[c('Season', 'Year')] <- lapply(dat[c('Season', 'Year')],
function(x) as.integer(factor(x)))
Weekday
needs to converted from an ordered factor variable with desired order of levels. It might be harmless if doing general aggregation, but will drastically affect results when implementing statistical models.
dat$Weekday <- as.integer(factor(dat$Weekday,
levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"), ordered = TRUE))
dat
# ID Season Year Weekday
#1 1 2 1 1
#2 2 2 2 2
#3 3 1 1 6 (Saturday)
#4 4 1 2 3 (Wednesday): assigned value less than that of Saturday
Data Used:
dat <- read.table(text=" ID Season Year Weekday
1 Winter 2017 Monday
2 Winter 2018 Tuesday
3 Summer 2017 Saturday
4 Summer 2018 Wednesday", header = TRUE)