categorical-data | 易学教程

Interpretation of ordered and non-ordered factors, vs. numerical predictors in model summary

阅读更多关于 Interpretation of ordered and non-ordered factors, vs. numerical predictors in model summary

问题 I have fitted a model where: Y ~ A + A^2 + B + mixed.effect(C) Y is continuous A is continuous B actually refers to a DAY and currently looks like this: Levels: 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9 < 11 < 12 I can easily change the data type, but I'm not sure whether it is more appropriate to treat B as numeric, a factor, or as an ordered factor. AND when treated as numeric or ordered factor, I'm not quite sure how to interpret the output. When treated as an ordered factor, summary(my.model)

How to handle categorical features with spark-ml?

阅读更多关于 How to handle categorical features with spark-ml?

问题 How do I handle categorical data with spark-ml and not spark-mllib ? Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier , LogisticRegression , have a featuresCol argument, which specifies the name of the column of features in the DataFrame , and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame . Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to

R: Random sampling an even number of observations from a range of categories

阅读更多关于 R: Random sampling an even number of observations from a range of categories

问题 I previously took a random sample of postcodes from my dataframe and then realised that I wasn't sampling across all higher level statistical units. I have around 1 million postcodes and 7000 middle output statistical units. I want the sample to have roughly the same number of postcodes from each statistical unit. How do I randomly sample 35 postcodes from each higher level statistical unit? I used the following code previously to randomly sample 250,000 postcodes: total.sample <- total

Pandas MultiIndex custom sort levels by categorical order, not alphabetically

阅读更多关于 Pandas MultiIndex custom sort levels by categorical order, not alphabetically

问题 I'm new to Pandas (0.16.1), and want custom sort in multiindex so i use Categoricals. Part of my multiindex: Part Defect Own Кузов 504 ИП Кузов 504 Итого Кузов 504 ПС Кузов 505 ПС Кузов 506 ПС Кузов 507 ПС Кузов 530 ИП Кузов 530 Итого Кузов 530 ПС I create pivot table with MultiIndex levels [Defect, Own]. Then i make "Own" Categorical (see p.s. part of question) to sort it as [ИП, ПС, Итого]. But when i prepend levels with "Part", which is also Categorical based on "Defect" level, and sort

Regression gives error on one of the input variables “contrasts can be applied only to factors with 2 or more levels” [duplicate]

阅读更多关于 Regression gives error on one of the input variables “contrasts can be applied only to factors with 2 or more levels” [duplicate]

问题 This question already has answers here : How to debug “contrasts can be applied only to factors with 2 or more levels” error? (2 answers) Closed last year . I am running a logit regression in R with a large number of input variables. newlogit <- glm(install. ~ SIZES + GROSSCONSUMPTION.... + NETTCONSUMPTION..... + NETTGENERATION....... + GROSSGENERATION.... + Variable. + Fixed + Cost.of.gross.cons + Cost.of.net.cons + Cons.savings + generation.gains + Total.savings + Cost.of.system + Payback +

How to generate random data set with predicted probability?

阅读更多关于 How to generate random data set with predicted probability?

问题 I'm struggling to generate random data set with predicted probability of multinomial logistic regression. Let's take an example. I'll use nnet package for multinomial logistic regression. I will also use wine data set in rattle.data package. library("nnet") library("rattle.data") data(wine) multinom.fit<-multinom(Type~Alcohol+Color,data=wine) summary(multinom.fit) Call: multinom(formula = Type ~ Alcohol + Color - 1, data = wine) Coefficients: Alcohol Color 2 0.6258035 -1.9480658 3 -0.3457799

Stata: saving regressions coefficients and standard errors in .dta file when there are factor variables

阅读更多关于 Stata: saving regressions coefficients and standard errors in .dta file when there are factor variables

问题 I would like to run several regressions and store their results in a DTA file that I could later use for analysis. My constraints are: I cannot install modules (I am writing code for other people and not sure what modules they have installed) Some of the regressors are factor variables. Each regression differ only by the dependent variable, so I would like to store that in the final dataset to keep track of what regression the coefficients/variances correspond to. I am seriously losing sanity

Changing Continuous Ranges to Categorical in R

阅读更多关于 Changing Continuous Ranges to Categorical in R

问题 I was trying to convert some continuous integers to categorical ranges, but something I did not understand happened. Although I fixed to get what I want, I still don't understand why it happened. The variable is some integers from 0 to 12, the following code left 10 , 11 , 12 out from the 5+ category. py2$Daily.Whole.Grain[py2$Daily.Whole.Grain==0]<-"0" py2$Daily.Whole.Grain[py2$Daily.Whole.Grain==1]<-"1" py2$Daily.Whole.Grain[py2$Daily.Whole.Grain==2]<-"2" py2$Daily.Whole.Grain[py2$Daily

change name of specific levels in factor

阅读更多关于 change name of specific levels in factor

问题 the data frame I am working on contains many factors. Take the categorical variables from mtcars (cyl, vs, am, gear, carb) . head(mtcars[c("cyl","vs","am","gear","carb")]) cyl vs am gear carb Mazda RX4 6 0 1 4 4 Mazda RX4 Wag 6 0 1 4 4 Datsun 710 4 1 1 4 1 Hornet 4 Drive 6 1 0 3 1 Hornet Sportabout 8 0 0 3 2 Valiant 6 1 0 3 1 Currently I have two nested for loops to extract those levels which occur less than in 10% of the time in the specific factor and assign it to a new level names. So I

Filtering and creating a column based on the date column

阅读更多关于 Filtering and creating a column based on the date column

问题 I have a sample data as below: date Deadline 2018-08-01 2018-08-11 2018-09-18 2018-12-08 2018-12-18 I want to fill in the deadline column with the conditions described in the code as "1 DL", "2 DL", "3 DL" and so on. Creating a new column based on the date column in python. It giving an error: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index 0') I have tried as below: df['date'] = pd.to_datetime(df['date'], format = "%y-%m-