categorical-data

Python equivalent of daisy() in the cluster package of R

微笑、不失礼 提交于 2019-12-20 09:38:17
问题 I have a dataset that contains both categorical (nominal and ordinal) and numerical attributes. I want to calculate the (dis)similarity matrix across my observations using these mixed attributes. Using the daisy() function of the cluster package in R, I can easily get a dissimilarity matrix as follows: if(!require("cluster")) { install.packages("cluster"); require("cluster") } data(flower) as.matrix(daisy(flower, metric = "gower")) This uses the gower metric to deal with the nominal variables

R: Expanding an R factor into dummy columns for every factor level

南笙酒味 提交于 2019-12-20 03:08:13
问题 I have a quite big data frame in R with two columns. I am trying to make out of the Code column ( factor type with 858 levels) the dummy variables. The problem is that the R Studio always crashed when I am trying to do that. > str(d) 'data.frame': 649226 obs. of 2 variables: $ User: int 210 210 210 210 269 317 317 317 317 326 ... $ Code : Factor w/ 858 levels "AA02","AA03",..: 164 494 538 626 464 496 435 464 475 163 ... The User column is not unique, meaning that there can be several rows

Creating categorical variables from mutually exclusive dummy variables

让人想犯罪 __ 提交于 2019-12-19 05:45:38
问题 My question regards an elaboration on a previously answered question about combining multiple dummy variables into a single categorical variable. In the question previously asked, the categorical variable was created from dummy variables that were NOT mutually exclusive. For my case, my dummy variables are mutually exclusive because they represent crossed experimental conditions in a 2X2 between-subjects factorial design (that also has a within subjects component which I'm not addressing here

R: creating a categorical variable from a numerical variable and custom/open-ended/single-valued intervals

為{幸葍}努か 提交于 2019-12-19 00:24:32
问题 I often find myself trying to create a categorical variable from a numerical variable + a user-provided set of ranges. For instance, say that I have a data.frame with a numeric variable df$V and would like to create a new variable df$VCAT such that: df$VCAT = 0 if df$V is equal to 0 df$VCAT = 1 if df$V is between 0 to 10 (i.e. (0,10)) df$VCAT = 2 is df$V is equal to 10 (i.e. [10,10]) df$VCAT = 3 is df$V is between 10 to 20 (i.e. (10,20)) df$VCAT = 4 is df$V is greater or equal to than 20 (i.e

R: creating a categorical variable from a numerical variable and custom/open-ended/single-valued intervals

做~自己de王妃 提交于 2019-12-19 00:23:35
问题 I often find myself trying to create a categorical variable from a numerical variable + a user-provided set of ranges. For instance, say that I have a data.frame with a numeric variable df$V and would like to create a new variable df$VCAT such that: df$VCAT = 0 if df$V is equal to 0 df$VCAT = 1 if df$V is between 0 to 10 (i.e. (0,10)) df$VCAT = 2 is df$V is equal to 10 (i.e. [10,10]) df$VCAT = 3 is df$V is between 10 to 20 (i.e. (10,20)) df$VCAT = 4 is df$V is greater or equal to than 20 (i.e

How can I ensure that a partition has representative observations from each level of a factor?

放肆的年华 提交于 2019-12-17 19:26:00
问题 I wrote a small function to partition my dataset into training and testing sets. However, I am running into trouble when dealing with factor variables. In the model validation phase of my code, I get an error if the model was built on a dataset that doesn't have representation from each level of a factor. How can I fix this partition() function to include at least one observation from every level of a factor variable? test.df <- data.frame(a = sample(c(0,1),100, rep = T), b = factor(sample

Generating Multiple Plots in ggplot by Factor

为君一笑 提交于 2019-12-17 18:39:50
问题 I have a data set that I want to generate multiple plots for based on one of the columns. That is, I want to be able to use ggplot to make a separate plot for each variety of that factor. Here's some quick sample data: Variety = as.factor(c("a","b","a","b","a","b","a","b","a","b") Var1 = runif(10) Var2 = runif(10) mydata = as.data.frame(cbind(Variety,Var1,Var2)) I'd like to generate two separate plots of Var1 over Var2, one for Variety A, a second for Variety B, preferably in a single command

Generating Multiple Plots in ggplot by Factor

流过昼夜 提交于 2019-12-17 18:39:35
问题 I have a data set that I want to generate multiple plots for based on one of the columns. That is, I want to be able to use ggplot to make a separate plot for each variety of that factor. Here's some quick sample data: Variety = as.factor(c("a","b","a","b","a","b","a","b","a","b") Var1 = runif(10) Var2 = runif(10) mydata = as.data.frame(cbind(Variety,Var1,Var2)) I'd like to generate two separate plots of Var1 over Var2, one for Variety A, a second for Variety B, preferably in a single command

Issue with OneHotEncoder for categorical features

强颜欢笑 提交于 2019-12-17 15:53:43
问题 I want to encode 3 categorical features out of 10 features in my datasets. I use preprocessing from sklearn.preprocessing to do so as the following: from sklearn import preprocessing cat_features = ['color', 'director_name', 'actor_2_name'] enc = preprocessing.OneHotEncoder(categorical_features=cat_features) enc.fit(dataset.values) However, I couldn't proceed as I am getting this error: array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: could not convert string to float:

Create dummies from column with multiple values in pandas

帅比萌擦擦* 提交于 2019-12-17 15:34:11
问题 I am looking for for a pythonic way to handle the following problem. The pandas.get_dummies() method is great to create dummies from a categorical column of a dataframe. For example, if the column has values in ['A', 'B'] , get_dummies() creates 2 dummy variables and assigns 0 or 1 accordingly. Now, I need to handle this situation. A single column, let's call it 'label', has values like ['A', 'B', 'C', 'D', 'A*C', 'C*D'] . get_dummies() creates 6 dummies, but I only want 4 of them, so that a