I have trouble generating the following dummy-variables in R:
I\'m analyzing yearly time series data (time period 1948-2009). I have two questions:
If you want to get K dummy variables, instead of K-1, try:
dummies = table(1:length(year),as.factor(year))
Best,
I read this on the kaggle forum:
#Generate example dataframe with character column
example <- as.data.frame(c("A", "A", "B", "F", "C", "G", "C", "D", "E", "F"))
names(example) <- "strcol"
#For every unique value in the string column, create a new 1/0 column
#This is what Factors do "under-the-hood" automatically when passed to function requiring numeric data
for(level in unique(example$strcol)){
example[paste("dummy", level, sep = "_")] <- ifelse(example$strcol == level, 1, 0)
}
Convert your data to a data.table and use set by reference and row filtering
library(data.table)
dt <- as.data.table(your.dataframe.or.whatever)
dt[, is.1957 := 0]
dt[year == 1957, is.1957 := 1]
Proof-of-concept toy example:
library(data.table)
dt <- as.data.table(cbind(c(1, 1, 1), c(2, 2, 3)))
dt[, is.3 := 0]
dt[V2 == 3, is.3 := 1]
I use such a function (for data.table):
# Ta funkcja dla obiektu data.table i zmiennej var.name typu factor tworzy dummy variables o nazwach "var.name: (level1)"
factorToDummy <- function(dtable, var.name){
stopifnot(is.data.table(dtable))
stopifnot(var.name %in% names(dtable))
stopifnot(is.factor(dtable[, get(var.name)]))
dtable[, paste0(var.name,": ",levels(get(var.name)))] -> new.names
dtable[, (new.names) := transpose(lapply(get(var.name), FUN = function(x){x == levels(get(var.name))})) ]
cat(paste("\nDodano zmienne dummy: ", paste0(new.names, collapse = ", ")))
}
Usage:
data <- data.table(data)
data[, x:= droplevels(x)]
factorToDummy(data, "x")
This one liner in base R
model.matrix( ~ iris$Species - 1)
gives
iris$Speciessetosa iris$Speciesversicolor iris$Speciesvirginica
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 1 0 0
7 1 0 0
8 1 0 0
9 1 0 0
10 1 0 0
11 1 0 0
12 1 0 0
13 1 0 0
14 1 0 0
15 1 0 0
16 1 0 0
17 1 0 0
18 1 0 0
19 1 0 0
20 1 0 0
21 1 0 0
22 1 0 0
23 1 0 0
24 1 0 0
25 1 0 0
26 1 0 0
27 1 0 0
28 1 0 0
29 1 0 0
30 1 0 0
31 1 0 0
32 1 0 0
33 1 0 0
34 1 0 0
35 1 0 0
36 1 0 0
37 1 0 0
38 1 0 0
39 1 0 0
40 1 0 0
41 1 0 0
42 1 0 0
43 1 0 0
44 1 0 0
45 1 0 0
46 1 0 0
47 1 0 0
48 1 0 0
49 1 0 0
50 1 0 0
51 0 1 0
52 0 1 0
53 0 1 0
54 0 1 0
55 0 1 0
56 0 1 0
57 0 1 0
58 0 1 0
59 0 1 0
60 0 1 0
61 0 1 0
62 0 1 0
63 0 1 0
64 0 1 0
65 0 1 0
66 0 1 0
67 0 1 0
68 0 1 0
69 0 1 0
70 0 1 0
71 0 1 0
72 0 1 0
73 0 1 0
74 0 1 0
75 0 1 0
76 0 1 0
77 0 1 0
78 0 1 0
79 0 1 0
80 0 1 0
81 0 1 0
82 0 1 0
83 0 1 0
84 0 1 0
85 0 1 0
86 0 1 0
87 0 1 0
88 0 1 0
89 0 1 0
90 0 1 0
91 0 1 0
92 0 1 0
93 0 1 0
94 0 1 0
95 0 1 0
96 0 1 0
97 0 1 0
98 0 1 0
99 0 1 0
100 0 1 0
101 0 0 1
102 0 0 1
103 0 0 1
104 0 0 1
105 0 0 1
106 0 0 1
107 0 0 1
108 0 0 1
109 0 0 1
110 0 0 1
111 0 0 1
112 0 0 1
113 0 0 1
114 0 0 1
115 0 0 1
116 0 0 1
117 0 0 1
118 0 0 1
119 0 0 1
120 0 0 1
121 0 0 1
122 0 0 1
123 0 0 1
124 0 0 1
125 0 0 1
126 0 0 1
127 0 0 1
128 0 0 1
129 0 0 1
130 0 0 1
131 0 0 1
132 0 0 1
133 0 0 1
134 0 0 1
135 0 0 1
136 0 0 1
137 0 0 1
138 0 0 1
139 0 0 1
140 0 0 1
141 0 0 1
142 0 0 1
143 0 0 1
144 0 0 1
145 0 0 1
146 0 0 1
147 0 0 1
148 0 0 1
149 0 0 1
150 0 0 1
The simplest way to produce these dummy variables is something like the following:
> print(year)
[1] 1956 1957 1957 1958 1958 1959
> dummy <- as.numeric(year == 1957)
> print(dummy)
[1] 0 1 1 0 0 0
> dummy2 <- as.numeric(year >= 1957)
> print(dummy2)
[1] 0 1 1 1 1 1
More generally, you can use ifelse
to choose between two values depending on a condition. So if instead of a 0-1 dummy variable, for some reason you wanted to use, say, 4 and 7, you could use ifelse(year == 1957, 4, 7)
.