chi-squared

SparkException: Chi-square test expect factors

房东的猫 提交于 2020-07-21 07:04:30
问题 I have a dataset containing 42 features and 1 label. I want to apply the selection method chi square selector of the library spark ML before executing Decision tree for the detection of anomaly but I meet this error during the applciation of chi square selector: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 1 times, most recent failure: Lost task 0.0 in stage 17.0 (TID 45, localhost, executor driver): org.apache.spark.SparkException: Chi-square

Plot the equivalent of correlation matrix for factors (categorical data)? And mixed types?

痴心易碎 提交于 2020-07-05 06:55:11
问题 Actually there are 2 questions, one is more advanced than the other. Q1: I am looking for a method that similar to corrplot() but can deal with factors. I originally tried to use chisq.test() then calculate the p-value and Cramer's V as correlation, but there too many columns to figure out. So could anyone tell me if there is a quick way to create a "corrplot" that each cell contains the value of Cramer's V , while the colour is rendered by p-value . Or any other kind of similar plot.

Plot the equivalent of correlation matrix for factors (categorical data)? And mixed types?

一笑奈何 提交于 2020-07-05 06:55:08
问题 Actually there are 2 questions, one is more advanced than the other. Q1: I am looking for a method that similar to corrplot() but can deal with factors. I originally tried to use chisq.test() then calculate the p-value and Cramer's V as correlation, but there too many columns to figure out. So could anyone tell me if there is a quick way to create a "corrplot" that each cell contains the value of Cramer's V , while the colour is rendered by p-value . Or any other kind of similar plot.

Feature selection using scikit-learn

狂风中的少年 提交于 2020-01-30 15:48:46
问题 I'm new in machine learning. I'm preparing my data for classification using Scikit Learn SVM. In order to select the best features I have used the following method: SelectKBest(chi2, k=10).fit_transform(A1, A2) Since my dataset consist of negative values, I get the following error: ValueError Traceback (most recent call last) /media/5804B87404B856AA/TFM_UC3M/test2_v.py in <module>() ----> 1 2 3 4 5 /usr/local/lib/python2.6/dist-packages/sklearn/base.pyc in fit_transform(self, X, y, **fit

how to run chisq.test in loops using apply

好久不见. 提交于 2020-01-16 19:14:19
问题 I am a newbie of R. Due to the need of my project, I need to do Chisq test for hundred thousand entries. I learned by myself for a few days and write some code for runing chisq.test in loops. codes: the.data = read.table ("test_chisq_allelefrq.txt", header=T, sep="\t",row.names=1) p=c() ID=c() for (i in 1:nrow(the.data)) { data.row = the.data [i,] data.matrix = matrix ( c(data.row$cohort_1_AA, data.row$cohort_1_AB, data.row$cohort_1_BB, data.row$cohort_2_AA, data.row$cohort_2_AB, data.row

Error using dynamic variable specification in R survey function svychisq()

混江龙づ霸主 提交于 2020-01-06 14:44:10
问题 I am using the functions in the R survey -library, and per this example on Stackoverflow, I use bquote() and as.name() to dynamically construct the formula for specifying the variables. This works fine for svytable() , but not for svychisq() . For example: library(survey) data(api) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) colvar <- 'sch.wide' rowvar <- 'awards' svytable(bquote(~.(as.name(rowvar)) + .(as.name(colvar)) ), dstrat) sch.wide awards No Yes No

Call R from JAVA to get Chi-squared statistic and p-value

旧城冷巷雨未停 提交于 2019-12-29 08:06:08
问题 I have two 4*4 matrices in JAVA, where one matrix holds observed counts and the other expected counts. I need an automated way to calculate the p-value from the chi-square statistic between these two matrices; however, JAVA has no such function as far as I am aware. I can calculate the chi-square and its p-value by reading the two matrices into R as .csv file formats, and then using the chisq.test function as follows: obs<-read.csv("obs.csv") exp<-read.csv("exp.csv") chisq.test(obs,exp) where

Calculate Fisher's exact test p-value in dataframe rows

自古美人都是妖i 提交于 2019-12-25 15:35:20
问题 I have a list of 1700 samples in a data frame where every row represents the number of colorful items that every assistant has counted in a random number of specimens from different boxes. There are two available colors and two individuals counting the items so this could easily create a 2x2 contingency table. df Box-ID 1_Red 1_Blue 2_Red 2_Blue 1 1075 918 29 26 2 903 1076 135 144 I would like to know how can I treat every row as a contigency table (either vector or matrix) in order to

SSAS (Sexual Segregation and Aggregation Statistic) in R - calling C

此生再无相见时 提交于 2019-12-24 19:29:01
问题 I am running the following code, found in this appendix of a paper https://wiley.figshare.com/articles/Supplement_1_R_code_used_to_format_the_data_and_compute_the_SSAS_/3528698/1 to calculate the Sexual Segregation and Aggregation Statistic in R - but keep getting the following error - presumably there is an issue with calling a function from C, but I cannot resolve it. # Main function, computes both the SSAS (Sexual Segregation and # Aggregation Statistic) and the 95% limits of SSAS # under