dummy-variable

Converting pandas column of comma-separated strings into dummy variables

余生长醉 提交于 2019-11-27 03:33:23
问题 In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas: 0 'a' 1 'a,b,c' 2 'a,b,d' 3 'd' 4 'c,d' Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated! Edit:

How to create dummy variable columns for thousands of categories in Google BigQuery?

最后都变了- 提交于 2019-11-26 23:08:33
I have a simple table with 2 columns: UserID and Category, and each UserID can repeat with a few categories, like so: UserID Category ------ -------- 1 A 1 B 2 C 3 A 3 C 3 B I want to "dummify" this table: i.e. to create an output table that has a unique column for each Category consisting of dummy variables (0/1 depending on whether the UserID belongs to that particular Category): UserID A B C ------ -- -- -- 1 1 1 0 2 0 0 1 3 1 1 1 My problem is that I have THOUSANDS of categories (not just 3 as in this example) and so this cannot be efficiently accomplished using CASE WHEN statement. So my

Creating dummy variables in R data.table

痴心易碎 提交于 2019-11-26 22:09:04
I am working with an extremely large dataset in R and have been operating with data frames and have decided to switch to data.tables to help speed up with operations. I am having trouble understanding the J operations, in particular I'm trying to generate dummy variables but I can't figure out how to code conditional operations within data.tables[]. MWE: test <- data.table("index"=rep(letters[1:10],100),"var1"=rnorm(1000,0,1)) What I would like to do is to add columns a through j as dummy variables such that column a would have a value 1 when the index == "a" and 0 otherwise. In the data.frame

Pandas: Get Dummies

社会主义新天地 提交于 2019-11-26 17:57:32
I have the following dataframe: amount catcode cid cycle date di feccandid type 0 1000 E1600 N00029285 2014 2014-05-15 D H8TX22107 24K 1 5000 G4600 N00026722 2014 2013-10-22 D H4TX28046 24K 2 4 C2100 N00030676 2014 2014-03-26 D H0MO07113 24Z I want to make dummy variables for the values in column type . There about 15. I have tried this: pd.get_dummies(df['type']) And it returns this: 24A 24C 24E 24F 24K 24N 24P 24R 24Z date 2014-05-15 0 0 0 0 1 0 0 0 0 2013-10-22 0 0 0 0 1 0 0 0 0 2014-03-26 0 0 0 0 0 0 0 0 1 What I would like is to have a dummy variable column for each unique value in Type

Dummy variables when not all categories are present

◇◆丶佛笑我妖孽 提交于 2019-11-26 16:08:09
I have a set of dataframes where one of the columns contains a categorical variable. I'd like to convert it to several dummy variables, in which case I'd normally use get_dummies . What happens is that get_dummies looks at the data available in each dataframe to find out how many categories there are, and thus create the appropriate number of dummy variables. However, in the problem I'm working right now, I actually know in advance what the possible categories are. But when looking at each dataframe individually, not all categories necessarily appear. My question is: is there a way to pass to

How to force R to use a specified factor level as reference in a regression?

徘徊边缘 提交于 2019-11-26 14:56:32
How can I tell R to use a certain level as reference if I use binary explanatory variables in a regression? It's just using some level by default. lm(x ~ y + as.factor(b)) with b {0, 1, 2, 3, 4} . Let's say I want to use 3 instead of the zero that is used by R. See the relevel() function. Here is an example: set.seed(123) x <- rnorm(100) DF <- data.frame(x = x, y = 4 + (1.5*x) + rnorm(100, sd = 2), b = gl(5, 20)) head(DF) str(DF) m1 <- lm(y ~ x + b, data = DF) summary(m1) Now alter the factor b in DF by use of the relevel() function: DF <- within(DF, b <- relevel(b, ref = 3)) m2 <- lm(y ~ x +

Split a string column into several dummy variables

一曲冷凌霜 提交于 2019-11-26 14:39:39
问题 As a relatively inexperienced user of the data.table package in R, I've been trying to process one text column into a large number of indicator columns (dummy variables), with a 1 in each column indicating that a particular sub-string was found within the string column. For example, I want to process this: ID String 1 a$b 2 b$c 3 c into this: ID String a b c 1 a$b 1 1 0 2 b$c 0 1 1 3 c 0 0 1 I have figured out how to do the processing, but it takes longer to run than I would like, and I

Pandas: Get Dummies

可紊 提交于 2019-11-26 09:53:47
问题 I have the following dataframe: amount catcode cid cycle date di feccandid type 0 1000 E1600 N00029285 2014 2014-05-15 D H8TX22107 24K 1 5000 G4600 N00026722 2014 2013-10-22 D H4TX28046 24K 2 4 C2100 N00030676 2014 2014-03-26 D H0MO07113 24Z I want to make dummy variables for the values in column type . There about 15. I have tried this: pd.get_dummies(df[\'type\']) And it returns this: 24A 24C 24E 24F 24K 24N 24P 24R 24Z date 2014-05-15 0 0 0 0 1 0 0 0 0 2013-10-22 0 0 0 0 1 0 0 0 0 2014-03

How to create dummy variable columns for thousands of categories in Google BigQuery?

放肆的年华 提交于 2019-11-26 08:34:16
问题 I have a simple table with 2 columns: UserID and Category, and each UserID can repeat with a few categories, like so: UserID Category ------ -------- 1 A 1 B 2 C 3 A 3 C 3 B I want to \"dummify\" this table: i.e. to create an output table that has a unique column for each Category consisting of dummy variables (0/1 depending on whether the UserID belongs to that particular Category): UserID A B C ------ -- -- -- 1 1 1 0 2 0 0 1 3 1 1 1 My problem is that I have THOUSANDS of categories (not

Creating dummy variables in R data.table

做~自己de王妃 提交于 2019-11-26 08:11:33
问题 I am working with an extremely large dataset in R and have been operating with data frames and have decided to switch to data.tables to help speed up with operations. I am having trouble understanding the J operations, in particular I\'m trying to generate dummy variables but I can\'t figure out how to code conditional operations within data.tables[]. MWE: test <- data.table(\"index\"=rep(letters[1:10],100),\"var1\"=rnorm(1000,0,1)) What I would like to do is to add columns a through j as