categorical-data | 易学教程

How to deal with this logic in pandas

阅读更多关于 How to deal with this logic in pandas

问题 I have a data frame like following below. coutry flag 0 China red 1 Russia green 2 China yellow 3 Britain yellow 4 Russia green ...................... In df['country'], you can see many different country names. I want to set the first appear country as 1, the second as 2. The flag is the same logic.So you can see the result is: coutry flag 0 1 1 1 2 2 2 1 3 3 3 3 4 2 2 But I don't know how to achieve this logic in python. Thank you. Moreover when I get the result data frame, I want to have an

Scikit-learn LabelEncoder: IndexError: arrays used as indices must be of integer (or boolean) type

阅读更多关于 Scikit-learn LabelEncoder: IndexError: arrays used as indices must be of integer (or boolean) type

问题 I am trying to preprocess adult data in order to make a classification. I deal with categorical attributes with scikit-learn. from sklearn.preprocessing import LabelEncoder labelencoder = LabelEncoder() X[:,0] = labelencoder.fit_transform(X[:,0]) labelencoder.classes_ output: array(['Federal-gov', 'Local-gov', 'Private', 'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay'], dtype=object) new content: X[:3] array([[5, 'Bachelors', 'Under-Graduate', 'Never-married', 'Adm-clerical',

Use cut to create 24 categories for a time variable

阅读更多关于 Use cut to create 24 categories for a time variable

问题 Here I import the data, do some manipulations to it (this is likely not going to be where the issue/fix lies) The first two lines set my parameters for my cut. lab_var_num <- (0:24) times_var <-c(0,100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000,2100,2200,2300,2400,2500) all_files_ls <- read_csv("~/Desktop/bioinformatic_work/log_parse_files/sorted_by_habitat/all_trap/all_files_la_selva_log.csv") #Eliminate bad data and capture in separate dataframe-

Extract Formula from lm including Categorical Variables (R)

阅读更多关于 Extract Formula from lm including Categorical Variables (R)

问题 I have an lm object and want to get the formula extracted with coefficients. This object includes categorical variables like month, as well as interactions with these categorical variables and numeric ones. Another user helped with some code that works for all but the categorical variables, however when I add a categorical variable (eg. d here) it breaks down and gives the error "Error in parse(text = x) : :1:785: unexpected numeric constant": a = c(1, 2, 5, 13, 40, 29, 82, 22, 34, 54, 12, 31

Extract Formula from lm including Categorical Variables (R)

阅读更多关于 Extract Formula from lm including Categorical Variables (R)

Handle unseen categorical string Spark CountVectorizer

阅读更多关于 Handle unseen categorical string Spark CountVectorizer

问题 I have seen StringIndexer has problems with unseen labels (see here). My question are: Does CountVectorizer have the same limitation? How does it treat a string not in the vocabulary? Moreover, is the vocabulary size affected by the input data or is it fixed according to the vocabulary size parameter? Last, from ML point of view, assuming a simple classifier such as Logistic Regression, shouldn't an unseen category be encoded as a row of zeros so to be treated as "unknown" so to get some sort

pandas cut(): how to convert nans? Or to convert the output to non-categorical?

阅读更多关于 pandas cut(): how to convert nans? Or to convert the output to non-categorical?

问题 I am using pandas.cut() on dataframe columns with nans. I need to run groupby on the output of pandas.cut(), so I need to convert nans to something else (in the output, not in the input data), otherwise groupby will stupidly and infuriatingly ignore them. I understand that cut() now outputs categorical data, but I cannot find a way to add a category to the output. I have tried add_categories(), which runs with no warning nor errors, but doesn't work because the categories are not added and,

Reveal k-modes cluster features

阅读更多关于 Reveal k-modes cluster features

问题 I'm performing a cluster analysis on categorical data, hence using k-modes approach. My data is shaped as a preference survey: How do you like hair and eyes? The respondent can pick up an answers from a fixed (multiple choice) set of 4 possibility. I therefore get the dummies, apply k-modes, attach the clusters back to the initial df and then plot them in 2D with pca. My code looks like: import numpy as np import pandas as pd from kmodes import kmodes df_dummy = pd.get_dummies(df) #transform

In pandas crosstab, how to calculate weighted averages? And how to add row and column totals?

阅读更多关于 In pandas crosstab, how to calculate weighted averages? And how to add row and column totals?

问题 I have a pandas dataframe with two categorical variables (in my example, city and colour), a column with percentages, and one with weights. I want to do a crosstab of city and colour, showing, for each combination of the two, the weighted average of perc. I have managed to do it with the code below, where I first create a column with weights x perc, then one crosstab with the sum of (weights x perc), another crosstab with the sum of weights, then finally divide the first by the second. It

How to make an average of a variable assigned to individuals within a category?

阅读更多关于 How to make an average of a variable assigned to individuals within a category?

问题 I have a big data set which could be represented something like this: plot 1 2 3 3 3 4 4 5 5 5 5 6 7 fate S M S S M S S S M S S M M where plot is a location, and fate is either "survivorship" or "mortality" ( a plant lives or dies.) The plot number of a plant corresponds to the fate under it. Thus in plot 5 there are 4 plants. 3 of them survive, 1 dies. I want to figure out a way to make R calculate the fraction of individuals that survive in each plot for all of these. It is proving very