categorical-data

Pandas ordered categorical data on exam grades 'D',…,'A+'

北城余情 提交于 2019-12-24 00:45:41
问题 I have the following data in pandas, I was surprized that the output was: D+ A I was expecting A+ D can someone explain please df = pd.DataFrame(['A+','A','A-','B+','B','B-','C+','C','C-','D+','D'], index = ['excellent','excellent','excellent','good','good','good','ok','ok','ok','poor','poor']) df.rename (columns={0:'Grades'},inplace=True) grades = df['Grades'].astype('category', categories = ['D','D+', 'C-', 'C','C+','B-','B','B+','A-','A','A+'],ordered=True) print(max(grades),min(grades)) >

How to use formula in R to exclude main effect but retain interaction

北城以北 提交于 2019-12-23 09:37:56
问题 I do not want main effect because it is collinear with a finer factor fixed effect, so it is annoying to have these NA . In this example: lm(y ~ x * z) I want the interaction of x (numeric) and z (factor), but not the main effect of z . 回答1: Introduction R documentation of ?formula says: The ‘*’ operator denotes factor crossing: ‘a * b’ interpreted as ‘a + b + a : b So it sounds like that dropping main effect is straightforward, by just doing one of the following: a + a:b ## main effect on `b

Categorical and ordinal feature data representation in regression analysis?

孤街醉人 提交于 2019-12-23 05:11:05
问题 I am trying to fully understand difference between categorical and ordinal data when doing regression analysis. For now, what is clear: Categorical feature and data example: Color: red, white, black Why categorical: red < white < black is logically incorrect Ordinal feature and data example: Condition: old, renovated, new Why ordinal: old < renovated < new is logically correct Categorical-to-numeric and ordinal-to-numeric encoding methods: One-Hot encoding for categorical data Arbitrary

Convert text to int64 categorical in Pandas

末鹿安然 提交于 2019-12-22 18:47:35
问题 I have some artist names in data['artist'] that I would like to convert to a categorical column via: x = data['artist'].astype('category').cat.codes x.dtype Returns: dtype('int32') I am getting negative numbers which suggests some sort of overflow situation. So, I'd like to use np.int64 instead but I can't find documentation on how to accomplish this. x = data['artist'].astype('category').cat.codes.astype(np.int64) x.dtype Gives dtype('int64') but it is clear that the int32 gets converted to

Matplotlib: how to plot a line with categorical data on the x-axis?

99封情书 提交于 2019-12-22 08:13:07
问题 I am trying to plot a few lines (not a bar plot, as in this case). My y values are float , whereas x values are categorical data . How to do this in matplotlib ? My values: data1=[5.65,7.61,8.17,7.60,9.54] data2=[7.61,16.17,16.18,19.54,19.81] data3=[29.55,30.24,31.51,36.40,35.47] My categories: x_axis=['A','B','C','D','E'] The code I am using, which does not give me what I want: import matplotlib.pyplot as plt fig=plt.figure() #Creates a new figure ax1=fig.add_subplot(111) #Plot with: 1 row,

How to encode categorical features in sklearn?

流过昼夜 提交于 2019-12-21 05:29:10
问题 I have a dataset with 41 features [from 0 to 40 columns], of which 7 are categorical. This categorical set is divided in two subset: A subset of string type(the column-features 1, 2, 3) A subset of int type, in binary form 0 or 1 (the column-features 6, 11, 20, 21) Furthermore the column-features 1, 2 and 3 (of string type) have cardinality 3, 66 and 11 respectively. In this context I have to encode them to use support vector machine algorithm. This is the code that I have: import numpy as np

how to check for correlation among continuous and categorical variables in python?

南笙酒味 提交于 2019-12-21 04:32:39
问题 I have a dataset including categorical variables(binary) and continuous variables. I'm trying to apply a linear regression model for predicting a continuous variable. Can someone please let me know how to check for correlation among the categorical variables and the continuous target variable. Current Code: import pandas as pd df_hosp = pd.read_csv('C:\Users\LAPPY-2\Desktop\LengthOfStay.csv') data = df_hosp[['lengthofstay', 'male', 'female', 'dialysisrenalendstage', 'asthma', \ 'irondef',

Any way to get mappings of a label encoder in Python pandas?

假装没事ソ 提交于 2019-12-20 16:56:51
问题 I am converting strings to categorical values in my dataset using the following piece of code. data['weekday'] = pd.Categorical.from_array(data.weekday).labels For eg, index weekday 0 Sunday 1 Sunday 2 Wednesday 3 Monday 4 Monday 5 Thursday 6 Tuesday After encoding the weekday, my dataset appears like this: index weekday 0 3 1 3 2 6 3 1 4 1 5 4 6 5 Is there any way I can know that Sunday has been mapped to 3, Wednesday to 6 and so on? 回答1: The best way of doing this can be to use label

Any way to get mappings of a label encoder in Python pandas?

只谈情不闲聊 提交于 2019-12-20 16:56:30
问题 I am converting strings to categorical values in my dataset using the following piece of code. data['weekday'] = pd.Categorical.from_array(data.weekday).labels For eg, index weekday 0 Sunday 1 Sunday 2 Wednesday 3 Monday 4 Monday 5 Thursday 6 Tuesday After encoding the weekday, my dataset appears like this: index weekday 0 3 1 3 2 6 3 1 4 1 5 4 6 5 Is there any way I can know that Sunday has been mapped to 3, Wednesday to 6 and so on? 回答1: The best way of doing this can be to use label

In gbm multinomial dist, how to use predict to get categorical output? [duplicate]

☆樱花仙子☆ 提交于 2019-12-20 12:31:09
问题 This question already has answers here : GBM multinomial distribution, how to use predict() to get predicted class? (2 answers) Closed 4 years ago . My response is a categorical variable (some alphabets), so I used distribution='multinomial' when making the model, and now I want to predict the response and obtain the output in terms of these alphabets, instead of matrix of probabilities. However in predict(model, newdata, type='response') , it gives probabilities, same as the result of type=