dummy-variable

Handling unknown values for label encoding

╄→гoц情女王★ 提交于 2019-12-03 02:15:42
How can I handle unknown values for label encoding in sk-learn? The label encoder will only blow up with an exception that new labels were detected. What I want is the encoding of categorical variables via one-hot -encoder. However, sk-learn does not support strings for that. So I used a label encoder on each column. My problem is that in my cross-validation step of the pipeline unknown labels show up. The basic one-hot-encoder would have the option to ignore such cases. An apriori pandas.getDummies /cat.codes is not sufficient as the pipeline should work with real-life, fresh incoming data

create a sparse matrix; given the indices of non-zero elements for creation of dummy variables of a categorical column of a large dataset

懵懂的女人 提交于 2019-12-02 13:41:51
问题 I'm trying to use a sparse matrix to generate dummy variables for a set of data with 5.8 million rows and two categorical columns. The structure of the data is: mydata: data.table of 5,800,000 rows and two categorical (in integer format) variables Var1 and Var2 nlevel(Var1) : 210,000 (levels include all numbers between 1 and 210,000) nlevel(Var2) : 500 (levels include all numbers between 1 and 500) here's an example of mydata: Var_1 Var_2 1 4 1 2 2 7 5 9 5 500 . . . 200 6 200 2 200 80 . . . I

Python Pandas: create a new column for each different value of a source column (with boolean output as column values)

雨燕双飞 提交于 2019-12-02 05:59:14
问题 I am trying to split a source column of a dataframe in several columns based on its content, and then fill this newly generated columns with a boolean 1 or 0 in the following way: Original dataframe: ID source_column A value 1 B NaN C value 2 D value 3 E value 2 Generating the following output: ID source_column value 1 value 2 value 3 A value 1 1 0 0 B NaN 0 0 0 C value 2 0 1 0 D value 3 0 0 1 E value 2 0 1 0 I thought about manually create each different column, and then with a function for

create a sparse matrix; given the indices of non-zero elements for creation of dummy variables of a categorical column of a large dataset

為{幸葍}努か 提交于 2019-12-02 05:58:18
I'm trying to use a sparse matrix to generate dummy variables for a set of data with 5.8 million rows and two categorical columns. The structure of the data is: mydata: data.table of 5,800,000 rows and two categorical (in integer format) variables Var1 and Var2 nlevel(Var1) : 210,000 (levels include all numbers between 1 and 210,000) nlevel(Var2) : 500 (levels include all numbers between 1 and 500) here's an example of mydata: Var_1 Var_2 1 4 1 2 2 7 5 9 5 500 . . . 200 6 200 2 200 80 . . . I'm using a sparse Matrix (sparse_Mx) to create the dummy variable matrix which would be of the form:

R: create dummy variables based on a categorical variable *of lists* [duplicate]

荒凉一梦 提交于 2019-12-01 15:24:56
问题 This question already has answers here : How can I split a character string into column vectors with a 1/0 value flag? (7 answers) Closed 7 months ago . I have a data frame with a categorical variable holding lists of strings, with variable length (it is important because otherwise this question would be a duplicate of this or this), e.g.: df <- data.frame(x = 1:5) df$y <- list("A", c("A", "B"), "C", c("B", "D", "C"), "E") df x y 1 1 A 2 2 A, B 3 3 C 4 4 B, D, C 5 5 E And the desired form is

Speed up this loop to create dummy columns with data.table and set in R [duplicate]

一世执手 提交于 2019-12-01 06:31:30
This question already has an answer here: Creating dummy variables in R data.table 1 answer I have a data table and I want to create a new column for each unique day, and then assign a 1 in each row where the day matches the column name I have done this using a for loop but I was wondering if there was any way to optimise it using data.table and set? Here is an example dt <- data.table(Week_Day = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) Day <- unique(dt$Week_Day) for (i in 1:length(Day)) { if (Day[i] != "Sunday") { dt[, Day[i] := ifelse(Week_Day == Day[i

Panda's get_dummies vs. Sklearn's OneHotEncoder() :: What are the pros and cons?

纵饮孤独 提交于 2019-11-29 19:06:00
I'm learning different methods to convert categorical variables to numeric for machine-learning classifiers. I came across the pd.get_dummies method and sklearn.preprocessing.OneHotEncoder() and I wanted to see how they differed in terms of performance and usage. I found a tutorial on how to use OneHotEnocder() on https://xgdgsc.wordpress.com/2015/03/20/note-on-using-onehotencoder-in-scikit-learn-to-work-on-categorical-features/ since the sklearn documentation wasn't too helpful on this feature. I have a feeling I'm not doing it correctly...but Can some explain the pros and cons of using pd

Converting pandas column of comma-separated strings into dummy variables

元气小坏坏 提交于 2019-11-28 10:23:06
In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas: 0 'a' 1 'a,b,c' 2 'a,b,d' 3 'd' 4 'c,d' Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated! Edit: Additional twist. Column has null values. And in response to comment, the following is the desired output.

Keep same dummy variable in training and testing data

佐手、 提交于 2019-11-27 17:52:07
I am building a prediction model in python with two separate training and testing sets. The training data contains numerical type categorical variable, e.g., zip code,[91521,23151,12355, ...], and also string categorical variables, e.g., city ['Chicago', 'New York', 'Los Angeles', ...]. To train the data, I first use the 'pd.get_dummies' to get dummy variable of these variable, and then fit the model with the transformed training data. I do the same transformation on my test data and predict the result using the trained model. However, I got the error 'ValueError: Number of features of the

Panda's get_dummies vs. Sklearn's OneHotEncoder() :: What are the pros and cons?

南笙酒味 提交于 2019-11-27 09:09:47
问题 I'm learning different methods to convert categorical variables to numeric for machine-learning classifiers. I came across the pd.get_dummies method and sklearn.preprocessing.OneHotEncoder() and I wanted to see how they differed in terms of performance and usage. I found a tutorial on how to use OneHotEnocder() on https://xgdgsc.wordpress.com/2015/03/20/note-on-using-onehotencoder-in-scikit-learn-to-work-on-categorical-features/ since the sklearn documentation wasn't too helpful on this