In python, scikit has a great function called LabelEncoder that maps categorical levels (strings) to integer representation.
Is there anything in R to do this?
Here is an easy and neat solution:
From the superml package: https://www.rdocumentation.org/packages/superml/versions/0.5.3 There is a LabelEncoder class: https://www.rdocumentation.org/packages/superml/versions/0.5.3/topics/LabelEncoder
install.packages("superml")
library(superml)
lbl <- LabelEncoder$new()
lbl$fit(sample_dat$column)
sample_dat$column <- lbl$fit_transform(sample_dat$column)
decode_names <- lbl$inverse_transform(sample_dat$column)
Create your vector of data:
colors <- c("red", "red", "blue", "green")
Create a factor:
factors <- factor(colors)
Convert the factor to numbers:
as.numeric(factors)
Output: (note that this is in alphabetical order)
# [1] 3 3 1 2
You can also set a custom numbering system: (note that the output now follows the "rainbow color order" that I defined)
rainbow <- c("red","orange","yellow","green","blue","purple")
ordered <- factor(colors, levels = rainbow)
as.numeric(ordered)
# [1] 1 1 5 4
See ?factor
.
It's hard to believe why no one has mentioned caret
's dummyVars
function.
This is a widely searched question, and people don't want to write their own methods or copy and paste other users methods, they want a package, and caret
is the closest thing to sklearn
in R.
EDIT: I now realize that what the user actually want's is to turn strings into a counting number, which is just as.numeric(as.factor(x))
but I'm going to leave this here because using hot-one encoding is the more accurate method of encoding categorical data.