categorical-data

Handling different Factor Levels in Train and Test data

寵の児 提交于 2020-08-24 14:56:58
问题 I have a training data set of 20 column , all of which are factors which i have to use for training a model, I have been given test data set on which I have to apply my model for predictions and submit. I was doing initial data exploration and just out of curiosity checked the levels of training data and testing data levels since we are dealing with all categorical variables.To my dismay most of the categories (variables) have different levels in training and testing data set. for example

Handling different Factor Levels in Train and Test data

折月煮酒 提交于 2020-08-24 14:43:48
问题 I have a training data set of 20 column , all of which are factors which i have to use for training a model, I have been given test data set on which I have to apply my model for predictions and submit. I was doing initial data exploration and just out of curiosity checked the levels of training data and testing data levels since we are dealing with all categorical variables.To my dismay most of the categories (variables) have different levels in training and testing data set. for example

read.csv doesn't seem to detect factors in R 4.0.0

落花浮王杯 提交于 2020-07-25 06:28:16
问题 I've recently updated to R 4.0.0 from R 3.5.1. The behaviour of read.csv seems to have changed - when I load .csv files in R 4.0.0 factors are not automatically detected, instead being recognised as characters. I'm also still running 3.5.1 on my machine, and when loading the same files in 3.5.1 using the same code, factors are recognised as factors. This is somewhat suboptimal. Any suggestions? I'm running Windows 10 Pro and create .csv files in Excel 2013. 回答1: As Ronak Shah said in a

read.csv doesn't seem to detect factors in R 4.0.0

安稳与你 提交于 2020-07-25 06:27:16
问题 I've recently updated to R 4.0.0 from R 3.5.1. The behaviour of read.csv seems to have changed - when I load .csv files in R 4.0.0 factors are not automatically detected, instead being recognised as characters. I'm also still running 3.5.1 on my machine, and when loading the same files in 3.5.1 using the same code, factors are recognised as factors. This is somewhat suboptimal. Any suggestions? I'm running Windows 10 Pro and create .csv files in Excel 2013. 回答1: As Ronak Shah said in a

read.csv doesn't seem to detect factors in R 4.0.0

百般思念 提交于 2020-07-25 06:26:39
问题 I've recently updated to R 4.0.0 from R 3.5.1. The behaviour of read.csv seems to have changed - when I load .csv files in R 4.0.0 factors are not automatically detected, instead being recognised as characters. I'm also still running 3.5.1 on my machine, and when loading the same files in 3.5.1 using the same code, factors are recognised as factors. This is somewhat suboptimal. Any suggestions? I'm running Windows 10 Pro and create .csv files in Excel 2013. 回答1: As Ronak Shah said in a

How to reverse Label Encoder from sklearn for multiple columns?

╄→гoц情女王★ 提交于 2020-07-03 05:17:52
问题 I would like to use the inverse_transform function for LabelEncoder on multiple columns. This is the code I use for more than one columns when applying LabelEncoder on a dataframe: class MultiColumnLabelEncoder: def __init__(self,columns = None): self.columns = columns # array of column names to encode def fit(self,X,y=None): return self # not relevant here def transform(self,X): ''' Transforms columns of X specified in self.columns using LabelEncoder(). If no columns specified, transforms

Categorical features correlation

霸气de小男生 提交于 2020-05-24 15:53:33
问题 I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures? 回答1: There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas,

Categorical features correlation

旧街凉风 提交于 2020-05-24 15:53:31
问题 I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures? 回答1: There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas,

Categorical features correlation

萝らか妹 提交于 2020-05-24 15:53:13
问题 I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures? 回答1: There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas,

Pandas get dummies() for numeric categorical data

你。 提交于 2020-05-09 06:42:08
问题 I have 2 columns: Sex (with categorical values of type string as 'male' and 'female') Class (with categorical values of type integer as 1 to 10) When I execute pd.get_dummies() on the above 2 columns, only 'Sex' is getting encoded into 2 columns. But 'Class' is not converted by get_dummies function. I want 'Class' to be converted into 10 dummy columns as well, similar to One Hot Encoding. Is this expected behavior? Is there an workaround? 回答1: You can convert values to strings: df1 = pd.get