categorical-data | 易学教程

Python equivalent of daisy() in the cluster package of R

阅读更多关于 Python equivalent of daisy() in the cluster package of R

问题 I have a dataset that contains both categorical (nominal and ordinal) and numerical attributes. I want to calculate the (dis)similarity matrix across my observations using these mixed attributes. Using the daisy() function of the cluster package in R, I can easily get a dissimilarity matrix as follows: if(!require("cluster")) { install.packages("cluster"); require("cluster") } data(flower) as.matrix(daisy(flower, metric = "gower")) This uses the gower metric to deal with the nominal variables

R: Expanding an R factor into dummy columns for every factor level

阅读更多关于 R: Expanding an R factor into dummy columns for every factor level

问题 I have a quite big data frame in R with two columns. I am trying to make out of the Code column ( factor type with 858 levels) the dummy variables. The problem is that the R Studio always crashed when I am trying to do that. > str(d) 'data.frame': 649226 obs. of 2 variables: $ User: int 210 210 210 210 269 317 317 317 317 326 ... $ Code : Factor w/ 858 levels "AA02","AA03",..: 164 494 538 626 464 496 435 464 475 163 ... The User column is not unique, meaning that there can be several rows

Creating categorical variables from mutually exclusive dummy variables

阅读更多关于 Creating categorical variables from mutually exclusive dummy variables

问题 My question regards an elaboration on a previously answered question about combining multiple dummy variables into a single categorical variable. In the question previously asked, the categorical variable was created from dummy variables that were NOT mutually exclusive. For my case, my dummy variables are mutually exclusive because they represent crossed experimental conditions in a 2X2 between-subjects factorial design (that also has a within subjects component which I'm not addressing here

R: creating a categorical variable from a numerical variable and custom/open-ended/single-valued intervals

阅读更多关于 R: creating a categorical variable from a numerical variable and custom/open-ended/single-valued intervals

问题 I often find myself trying to create a categorical variable from a numerical variable + a user-provided set of ranges. For instance, say that I have a data.frame with a numeric variable df$V and would like to create a new variable df$VCAT such that: df$VCAT = 0 if df$V is equal to 0 df$VCAT = 1 if df$V is between 0 to 10 (i.e. (0,10)) df$VCAT = 2 is df$V is equal to 10 (i.e. [10,10]) df$VCAT = 3 is df$V is between 10 to 20 (i.e. (10,20)) df$VCAT = 4 is df$V is greater or equal to than 20 (i.e

R: creating a categorical variable from a numerical variable and custom/open-ended/single-valued intervals

阅读更多关于 R: creating a categorical variable from a numerical variable and custom/open-ended/single-valued intervals

How can I ensure that a partition has representative observations from each level of a factor?

阅读更多关于 How can I ensure that a partition has representative observations from each level of a factor?

问题 I wrote a small function to partition my dataset into training and testing sets. However, I am running into trouble when dealing with factor variables. In the model validation phase of my code, I get an error if the model was built on a dataset that doesn't have representation from each level of a factor. How can I fix this partition() function to include at least one observation from every level of a factor variable? test.df <- data.frame(a = sample(c(0,1),100, rep = T), b = factor(sample

Generating Multiple Plots in ggplot by Factor

阅读更多关于 Generating Multiple Plots in ggplot by Factor

问题 I have a data set that I want to generate multiple plots for based on one of the columns. That is, I want to be able to use ggplot to make a separate plot for each variety of that factor. Here's some quick sample data: Variety = as.factor(c("a","b","a","b","a","b","a","b","a","b") Var1 = runif(10) Var2 = runif(10) mydata = as.data.frame(cbind(Variety,Var1,Var2)) I'd like to generate two separate plots of Var1 over Var2, one for Variety A, a second for Variety B, preferably in a single command

Generating Multiple Plots in ggplot by Factor

阅读更多关于 Generating Multiple Plots in ggplot by Factor

Issue with OneHotEncoder for categorical features

阅读更多关于 Issue with OneHotEncoder for categorical features

问题 I want to encode 3 categorical features out of 10 features in my datasets. I use preprocessing from sklearn.preprocessing to do so as the following: from sklearn import preprocessing cat_features = ['color', 'director_name', 'actor_2_name'] enc = preprocessing.OneHotEncoder(categorical_features=cat_features) enc.fit(dataset.values) However, I couldn't proceed as I am getting this error: array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: could not convert string to float:

Create dummies from column with multiple values in pandas

阅读更多关于 Create dummies from column with multiple values in pandas

问题 I am looking for for a pythonic way to handle the following problem. The pandas.get_dummies() method is great to create dummies from a categorical column of a dataframe. For example, if the column has values in ['A', 'B'] , get_dummies() creates 2 dummy variables and assigns 0 or 1 accordingly. Now, I need to handle this situation. A single column, let's call it 'label', has values like ['A', 'B', 'C', 'D', 'A*C', 'C*D'] . get_dummies() creates 6 dummies, but I only want 4 of them, so that a