categorical-data

Interpretation of ordered and non-ordered factors, vs. numerical predictors in model summary

亡梦爱人 提交于 2019-12-17 13:38:08
问题 I have fitted a model where: Y ~ A + A^2 + B + mixed.effect(C) Y is continuous A is continuous B actually refers to a DAY and currently looks like this: Levels: 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9 < 11 < 12 I can easily change the data type, but I'm not sure whether it is more appropriate to treat B as numeric, a factor, or as an ordered factor. AND when treated as numeric or ordered factor, I'm not quite sure how to interpret the output. When treated as an ordered factor, summary(my.model)

How to handle categorical features with spark-ml?

天涯浪子 提交于 2019-12-17 02:34:30
问题 How do I handle categorical data with spark-ml and not spark-mllib ? Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier , LogisticRegression , have a featuresCol argument, which specifies the name of the column of features in the DataFrame , and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame . Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to

R: Random sampling an even number of observations from a range of categories

坚强是说给别人听的谎言 提交于 2019-12-13 16:28:47
问题 I previously took a random sample of postcodes from my dataframe and then realised that I wasn't sampling across all higher level statistical units. I have around 1 million postcodes and 7000 middle output statistical units. I want the sample to have roughly the same number of postcodes from each statistical unit. How do I randomly sample 35 postcodes from each higher level statistical unit? I used the following code previously to randomly sample 250,000 postcodes: total.sample <- total

Pandas MultiIndex custom sort levels by categorical order, not alphabetically

你。 提交于 2019-12-13 12:31:48
问题 I'm new to Pandas (0.16.1), and want custom sort in multiindex so i use Categoricals. Part of my multiindex: Part Defect Own Кузов 504 ИП Кузов 504 Итого Кузов 504 ПС Кузов 505 ПС Кузов 506 ПС Кузов 507 ПС Кузов 530 ИП Кузов 530 Итого Кузов 530 ПС I create pivot table with MultiIndex levels [Defect, Own]. Then i make "Own" Categorical (see p.s. part of question) to sort it as [ИП, ПС, Итого]. But when i prepend levels with "Part", which is also Categorical based on "Defect" level, and sort

Regression gives error on one of the input variables “contrasts can be applied only to factors with 2 or more levels” [duplicate]

风流意气都作罢 提交于 2019-12-13 10:37:28
问题 This question already has answers here : How to debug “contrasts can be applied only to factors with 2 or more levels” error? (2 answers) Closed last year . I am running a logit regression in R with a large number of input variables. newlogit <- glm(install. ~ SIZES + GROSSCONSUMPTION.... + NETTCONSUMPTION..... + NETTGENERATION....... + GROSSGENERATION.... + Variable. + Fixed + Cost.of.gross.cons + Cost.of.net.cons + Cons.savings + generation.gains + Total.savings + Cost.of.system + Payback +

How to generate random data set with predicted probability?

ぐ巨炮叔叔 提交于 2019-12-13 05:41:14
问题 I'm struggling to generate random data set with predicted probability of multinomial logistic regression. Let's take an example. I'll use nnet package for multinomial logistic regression. I will also use wine data set in rattle.data package. library("nnet") library("rattle.data") data(wine) multinom.fit<-multinom(Type~Alcohol+Color,data=wine) summary(multinom.fit) Call: multinom(formula = Type ~ Alcohol + Color - 1, data = wine) Coefficients: Alcohol Color 2 0.6258035 -1.9480658 3 -0.3457799

Stata: saving regressions coefficients and standard errors in .dta file when there are factor variables

霸气de小男生 提交于 2019-12-13 04:35:10
问题 I would like to run several regressions and store their results in a DTA file that I could later use for analysis. My constraints are: I cannot install modules (I am writing code for other people and not sure what modules they have installed) Some of the regressors are factor variables. Each regression differ only by the dependent variable, so I would like to store that in the final dataset to keep track of what regression the coefficients/variances correspond to. I am seriously losing sanity

Changing Continuous Ranges to Categorical in R

六眼飞鱼酱① 提交于 2019-12-13 04:25:34
问题 I was trying to convert some continuous integers to categorical ranges, but something I did not understand happened. Although I fixed to get what I want, I still don't understand why it happened. The variable is some integers from 0 to 12, the following code left 10 , 11 , 12 out from the 5+ category. py2$Daily.Whole.Grain[py2$Daily.Whole.Grain==0]<-"0" py2$Daily.Whole.Grain[py2$Daily.Whole.Grain==1]<-"1" py2$Daily.Whole.Grain[py2$Daily.Whole.Grain==2]<-"2" py2$Daily.Whole.Grain[py2$Daily

change name of specific levels in factor

ε祈祈猫儿з 提交于 2019-12-13 01:23:03
问题 the data frame I am working on contains many factors. Take the categorical variables from mtcars (cyl, vs, am, gear, carb) . head(mtcars[c("cyl","vs","am","gear","carb")]) cyl vs am gear carb Mazda RX4 6 0 1 4 4 Mazda RX4 Wag 6 0 1 4 4 Datsun 710 4 1 1 4 1 Hornet 4 Drive 6 1 0 3 1 Hornet Sportabout 8 0 0 3 2 Valiant 6 1 0 3 1 Currently I have two nested for loops to extract those levels which occur less than in 10% of the time in the specific factor and assign it to a new level names. So I

Filtering and creating a column based on the date column

二次信任 提交于 2019-12-12 19:30:17
问题 I have a sample data as below: date Deadline 2018-08-01 2018-08-11 2018-09-18 2018-12-08 2018-12-18 I want to fill in the deadline column with the conditions described in the code as "1 DL", "2 DL", "3 DL" and so on. Creating a new column based on the date column in python. It giving an error: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index 0') I have tried as below: df['date'] = pd.to_datetime(df['date'], format = "%y-%m-