categorical-data | 易学教程

Pandas ordered categorical data on exam grades 'D',…,'A+'

阅读更多关于 Pandas ordered categorical data on exam grades 'D',…,'A+'

问题 I have the following data in pandas, I was surprized that the output was: D+ A I was expecting A+ D can someone explain please df = pd.DataFrame(['A+','A','A-','B+','B','B-','C+','C','C-','D+','D'], index = ['excellent','excellent','excellent','good','good','good','ok','ok','ok','poor','poor']) df.rename (columns={0:'Grades'},inplace=True) grades = df['Grades'].astype('category', categories = ['D','D+', 'C-', 'C','C+','B-','B','B+','A-','A','A+'],ordered=True) print(max(grades),min(grades)) >

How to use formula in R to exclude main effect but retain interaction

阅读更多关于 How to use formula in R to exclude main effect but retain interaction

问题 I do not want main effect because it is collinear with a finer factor fixed effect, so it is annoying to have these NA . In this example: lm(y ~ x * z) I want the interaction of x (numeric) and z (factor), but not the main effect of z . 回答1: Introduction R documentation of ?formula says: The ‘*’ operator denotes factor crossing: ‘a * b’ interpreted as ‘a + b + a : b So it sounds like that dropping main effect is straightforward, by just doing one of the following: a + a:b ## main effect on `b

Categorical and ordinal feature data representation in regression analysis?

阅读更多关于 Categorical and ordinal feature data representation in regression analysis?

问题 I am trying to fully understand difference between categorical and ordinal data when doing regression analysis. For now, what is clear: Categorical feature and data example: Color: red, white, black Why categorical: red < white < black is logically incorrect Ordinal feature and data example: Condition: old, renovated, new Why ordinal: old < renovated < new is logically correct Categorical-to-numeric and ordinal-to-numeric encoding methods: One-Hot encoding for categorical data Arbitrary

Convert text to int64 categorical in Pandas

阅读更多关于 Convert text to int64 categorical in Pandas

问题 I have some artist names in data['artist'] that I would like to convert to a categorical column via: x = data['artist'].astype('category').cat.codes x.dtype Returns: dtype('int32') I am getting negative numbers which suggests some sort of overflow situation. So, I'd like to use np.int64 instead but I can't find documentation on how to accomplish this. x = data['artist'].astype('category').cat.codes.astype(np.int64) x.dtype Gives dtype('int64') but it is clear that the int32 gets converted to

Matplotlib: how to plot a line with categorical data on the x-axis?

阅读更多关于 Matplotlib: how to plot a line with categorical data on the x-axis?

问题 I am trying to plot a few lines (not a bar plot, as in this case). My y values are float , whereas x values are categorical data . How to do this in matplotlib ? My values: data1=[5.65,7.61,8.17,7.60,9.54] data2=[7.61,16.17,16.18,19.54,19.81] data3=[29.55,30.24,31.51,36.40,35.47] My categories: x_axis=['A','B','C','D','E'] The code I am using, which does not give me what I want: import matplotlib.pyplot as plt fig=plt.figure() #Creates a new figure ax1=fig.add_subplot(111) #Plot with: 1 row,

How to encode categorical features in sklearn?

阅读更多关于 How to encode categorical features in sklearn?

问题 I have a dataset with 41 features [from 0 to 40 columns], of which 7 are categorical. This categorical set is divided in two subset: A subset of string type(the column-features 1, 2, 3) A subset of int type, in binary form 0 or 1 (the column-features 6, 11, 20, 21) Furthermore the column-features 1, 2 and 3 (of string type) have cardinality 3, 66 and 11 respectively. In this context I have to encode them to use support vector machine algorithm. This is the code that I have: import numpy as np

how to check for correlation among continuous and categorical variables in python?

阅读更多关于 how to check for correlation among continuous and categorical variables in python?

问题 I have a dataset including categorical variables(binary) and continuous variables. I'm trying to apply a linear regression model for predicting a continuous variable. Can someone please let me know how to check for correlation among the categorical variables and the continuous target variable. Current Code: import pandas as pd df_hosp = pd.read_csv('C:\Users\LAPPY-2\Desktop\LengthOfStay.csv') data = df_hosp[['lengthofstay', 'male', 'female', 'dialysisrenalendstage', 'asthma', \ 'irondef',

Any way to get mappings of a label encoder in Python pandas?

阅读更多关于 Any way to get mappings of a label encoder in Python pandas?

问题 I am converting strings to categorical values in my dataset using the following piece of code. data['weekday'] = pd.Categorical.from_array(data.weekday).labels For eg, index weekday 0 Sunday 1 Sunday 2 Wednesday 3 Monday 4 Monday 5 Thursday 6 Tuesday After encoding the weekday, my dataset appears like this: index weekday 0 3 1 3 2 6 3 1 4 1 5 4 6 5 Is there any way I can know that Sunday has been mapped to 3, Wednesday to 6 and so on? 回答1: The best way of doing this can be to use label

Any way to get mappings of a label encoder in Python pandas?

阅读更多关于 Any way to get mappings of a label encoder in Python pandas?

In gbm multinomial dist, how to use predict to get categorical output? [duplicate]

阅读更多关于 In gbm multinomial dist, how to use predict to get categorical output? [duplicate]

问题 This question already has answers here : GBM multinomial distribution, how to use predict() to get predicted class? (2 answers) Closed 4 years ago . My response is a categorical variable (some alphabets), so I used distribution='multinomial' when making the model, and now I want to predict the response and obtain the output in terms of these alphabets, instead of matrix of probabilities. However in predict(model, newdata, type='response') , it gives probabilities, same as the result of type=