categorical-data

How to deal with this logic in pandas

∥☆過路亽.° 提交于 2020-05-09 06:05:27
问题 I have a data frame like following below. coutry flag 0 China red 1 Russia green 2 China yellow 3 Britain yellow 4 Russia green ...................... In df['country'], you can see many different country names. I want to set the first appear country as 1, the second as 2. The flag is the same logic.So you can see the result is: coutry flag 0 1 1 1 2 2 2 1 3 3 3 3 4 2 2 But I don't know how to achieve this logic in python. Thank you. Moreover when I get the result data frame, I want to have an

Scikit-learn LabelEncoder: IndexError: arrays used as indices must be of integer (or boolean) type

安稳与你 提交于 2020-02-03 01:53:06
问题 I am trying to preprocess adult data in order to make a classification. I deal with categorical attributes with scikit-learn. from sklearn.preprocessing import LabelEncoder labelencoder = LabelEncoder() X[:,0] = labelencoder.fit_transform(X[:,0]) labelencoder.classes_ output: array(['Federal-gov', 'Local-gov', 'Private', 'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay'], dtype=object) new content: X[:3] array([[5, 'Bachelors', 'Under-Graduate', 'Never-married', 'Adm-clerical',

Use cut to create 24 categories for a time variable

爷,独闯天下 提交于 2020-01-30 10:58:46
问题 Here I import the data, do some manipulations to it (this is likely not going to be where the issue/fix lies) The first two lines set my parameters for my cut. lab_var_num <- (0:24) times_var <-c(0,100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000,2100,2200,2300,2400,2500) all_files_ls <- read_csv("~/Desktop/bioinformatic_work/log_parse_files/sorted_by_habitat/all_trap/all_files_la_selva_log.csv") #Eliminate bad data and capture in separate dataframe-

Extract Formula from lm including Categorical Variables (R)

倖福魔咒の 提交于 2020-01-25 09:51:09
问题 I have an lm object and want to get the formula extracted with coefficients. This object includes categorical variables like month, as well as interactions with these categorical variables and numeric ones. Another user helped with some code that works for all but the categorical variables, however when I add a categorical variable (eg. d here) it breaks down and gives the error "Error in parse(text = x) : :1:785: unexpected numeric constant": a = c(1, 2, 5, 13, 40, 29, 82, 22, 34, 54, 12, 31

Extract Formula from lm including Categorical Variables (R)

♀尐吖头ヾ 提交于 2020-01-25 09:51:05
问题 I have an lm object and want to get the formula extracted with coefficients. This object includes categorical variables like month, as well as interactions with these categorical variables and numeric ones. Another user helped with some code that works for all but the categorical variables, however when I add a categorical variable (eg. d here) it breaks down and gives the error "Error in parse(text = x) : :1:785: unexpected numeric constant": a = c(1, 2, 5, 13, 40, 29, 82, 22, 34, 54, 12, 31

Handle unseen categorical string Spark CountVectorizer

只谈情不闲聊 提交于 2020-01-24 16:32:21
问题 I have seen StringIndexer has problems with unseen labels (see here). My question are: Does CountVectorizer have the same limitation? How does it treat a string not in the vocabulary? Moreover, is the vocabulary size affected by the input data or is it fixed according to the vocabulary size parameter? Last, from ML point of view, assuming a simple classifier such as Logistic Regression, shouldn't an unseen category be encoded as a row of zeros so to be treated as "unknown" so to get some sort

pandas cut(): how to convert nans? Or to convert the output to non-categorical?

房东的猫 提交于 2020-01-24 07:57:48
问题 I am using pandas.cut() on dataframe columns with nans. I need to run groupby on the output of pandas.cut(), so I need to convert nans to something else (in the output, not in the input data), otherwise groupby will stupidly and infuriatingly ignore them. I understand that cut() now outputs categorical data, but I cannot find a way to add a category to the output. I have tried add_categories(), which runs with no warning nor errors, but doesn't work because the categories are not added and,

Reveal k-modes cluster features

五迷三道 提交于 2020-01-22 06:00:46
问题 I'm performing a cluster analysis on categorical data, hence using k-modes approach. My data is shaped as a preference survey: How do you like hair and eyes? The respondent can pick up an answers from a fixed (multiple choice) set of 4 possibility. I therefore get the dummies, apply k-modes, attach the clusters back to the initial df and then plot them in 2D with pca. My code looks like: import numpy as np import pandas as pd from kmodes import kmodes df_dummy = pd.get_dummies(df) #transform

In pandas crosstab, how to calculate weighted averages? And how to add row and column totals?

℡╲_俬逩灬. 提交于 2020-01-15 09:14:27
问题 I have a pandas dataframe with two categorical variables (in my example, city and colour), a column with percentages, and one with weights. I want to do a crosstab of city and colour, showing, for each combination of the two, the weighted average of perc. I have managed to do it with the code below, where I first create a column with weights x perc, then one crosstab with the sum of (weights x perc), another crosstab with the sum of weights, then finally divide the first by the second. It

How to make an average of a variable assigned to individuals within a category?

妖精的绣舞 提交于 2020-01-15 07:18:11
问题 I have a big data set which could be represented something like this: plot 1 2 3 3 3 4 4 5 5 5 5 6 7 fate S M S S M S S S M S S M M where plot is a location, and fate is either "survivorship" or "mortality" ( a plant lives or dies.) The plot number of a plant corresponds to the fate under it. Thus in plot 5 there are 4 plants. 3 of them survive, 1 dies. I want to figure out a way to make R calculate the fraction of individuals that survive in each plot for all of these. It is proving very