categorical-data

GBM multinomial distribution, how to use predict() to get predicted class?

独自空忆成欢 提交于 2019-12-30 08:27:47
问题 I am using the multinomial distribution from the gbm package in R. When I use the predict function, I get a series of values: 5.086328 -4.738346 -8.492738 -5.980720 -4.351102 -4.738044 -3.220387 -4.732654 but I want to get the probability of each class occurring. How do I recover the probabilities? Thank You. 回答1: Take a look at ?predict.gbm , you'll see that there is a "type" parameter to the function. Try out predict(<gbm object>, <new data>, type="response") . 回答2: predict.gbm(..., type=

How to transform a categorical variable in Spark into a set of columns coded as {0,1}?

不想你离开。 提交于 2019-12-30 01:20:09
问题 I'm trying to perform a logistic regression (LogisticRegressionWithLBFGS) with Spark MLlib (with Scala) on a dataset which contains categorical variables. I discover Spark was not able to work with that kind of variable. In R there is a simple way to deal with that kind of problem : I transform the variable in factor (categories), so R creates a set of columns coded as {0,1} indicator variables. How can I perform this with Spark? 回答1: Using VectorIndexer, you may tell the indexer the number

R graphics: How to plot a sequence of characters (pure categorical time series)

人盡茶涼 提交于 2019-12-25 18:46:48
问题 I have a matrix in which each element is a pure categorical variable "a","b","c","d",... Each column of the matrix is a chronological entry and now I want to plot the matrix by row and I hope the y-axis is the sequence of characters. Here is the original matrix: Here is what I wanted the plot to be: The red plot is first row of the matrix and the blue plot is the fifth. I have tried some existing packages but mostly they require me to transfer the categorical variables to numerical variables.

R graphics: How to plot a sequence of characters (pure categorical time series)

感情迁移 提交于 2019-12-25 18:46:09
问题 I have a matrix in which each element is a pure categorical variable "a","b","c","d",... Each column of the matrix is a chronological entry and now I want to plot the matrix by row and I hope the y-axis is the sequence of characters. Here is the original matrix: Here is what I wanted the plot to be: The red plot is first row of the matrix and the blue plot is the fifth. I have tried some existing packages but mostly they require me to transfer the categorical variables to numerical variables.

Heatmap of categorical variable counts

谁说我不能喝 提交于 2019-12-24 14:05:45
问题 I have a data frame of items, and each has multiple classifier columns that are categorical variables. ID test1 test2 test3 1 A B A 2 B A C 3 C C C 4 A A B 5 B B B 6 B A C I want to generate a heatmap for each combination of test columns (test1 v test2, test1 v test3, etc.) using ggplot2. The heatmap would have all factors in that test's column (in this case A,B,C) on the x-side and all factors of the other test on the y-side, and the boxes in the heatmap should be colored based on the count

Generate two categorical variables with a chosen degree of association in R

可紊 提交于 2019-12-24 11:43:56
问题 I'd like to use R to generate two categorical variables (such as eye color and hair color, for instance) where I can specify the degree to which these two variables are associated. It doesn't really matter to me which levels of eye color would be associated with which levels of hair color, but just being able to specify an overall association, such as by specifying the odds ratio, is a requirement. Also, I know there are ways to do this for two normally distributed continuous variables using,

Pandas Categorical data type not behaving as expected

蹲街弑〆低调 提交于 2019-12-24 10:47:02
问题 I have the Pandas (version 0.15.2) dataframe below. I want to make the code column an ordered variable of type Categorical after the df creation as below. import pandas as pd df = pd.DataFrame({'id' : range(1,9), 'code' : ['one', 'one', 'two', 'three', 'two', 'three', 'one', 'two'], 'amount' : np.random.randn(8)}, columns= ['id','code','amount']) df.code = df.code.astype('category') >> 0 one >> 1 one >> 2 two >> 3 three >> 4 two >> 5 three >> 6 one >> 7 two >> Name: code, dtype: category >>

Conditional calculation in R based on Row values and categories

孤者浪人 提交于 2019-12-24 08:19:36
问题 I have this dataframe: df<-data.frame(a=c("a1","a2","a3","a4","b1","b2","b3","b4","a1","a2","a3","a4","b1","b2","b3","b4"), b=c("x1","x2","x3","total","x1","x2","x3","total", "x1","x2","x3","total","x1","x2","x3","total"), reg=c("A","A","A","A","A","A","A","A","B", "B","B","B","B","B","B","B"), c=c(1:16)) which looks like: a b reg c 1 a1 x1 A 1 2 a2 x2 A 2 3 a3 x3 A 3 4 a4 total A 4 5 b1 x1 A 5 6 b2 x2 A 6 7 b3 x3 A 7 8 b4 total A 8 9 a1 x1 B 9 10 a2 x2 B 10 11 a3 x3 B 11 12 a4 total B 12 13

SQL subquery to get the total

梦想与她 提交于 2019-12-24 06:58:42
问题 Using SQL subquery, how do I get the total items and total revenue for each manager including his team? Suppose I have this table items_revenue with columns: All the managers (is_manager=1) and their respective members are in the above table. Member1 is under Manager1, Member2 is under Manager2, and so on, but real data are in random arrangement. I want my query to output the ff.: This is related to SQL query to get the subtotal of some rows but I don't want to use the CASE expression. Thanks

line graph with 2 categorical variables and 1 continuous in R

牧云@^-^@ 提交于 2019-12-24 04:18:40
问题 I'm quite new to R and statistics in general. I am trying to plot in a line graph 2 categorical variables (part of speech "pos", condition "trcond") and a numerical one (score "totacc") in ggplot2. > df1<-df[, c("trcond", "subtitle", "pos", "totacc")] > head(df1) trcond subtitle pos totacc 7 L New Scene_16 lex 0.250 29 N New Scene_16 lex 0.500 8 L New Scene_25 lex 0.875 30 N New Scene_25 lex 0.666 9 L New Scene_29 lex 1.000 31 N New Scene_29 lex 0.833 I have used this ggplot2 command: >ggplot