chi-squared

how to understand the chi square contingency table

Deadly 提交于 2019-12-24 09:25:33
问题 I have few categorical features: ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area'] from scipy.stats import chi2_contingency chi2, p, dof, expected = chi2_contingency((pd.crosstab(df.Gender, df.Married).values)) print (f'Chi-square Statistic : {chi2} ,p-value: {p}') output: Chi-square Statistic : 79.63562874824729 ,p-value: 4.502328957824834e-19 How can I know if the features are independent from each other from these statistics? I am trying to build a

Using R, apply multiple chi-square contingency table tests to a grouped data frame and add a new column containing the p values of the tests

此生再无相见时 提交于 2019-12-23 18:47:46
问题 I have a data frame similar to the example below (which is a small extract of my actual data frame). frequencies <- data.frame(sex=c("female", "female", "male", "male", "female", "female", "male", "male", "female", "female", "male", "male", "female", "female", "male", "male"), ecotype=c("Crab", "Wave", "Crab", "Wave", "Crab", "Wave", "Crab", "Wave", "Crab", "Wave", "Crab", "Wave", "Crab", "Wave", "Crab", "Wave"), contig_ID=c("Contig100169_2367", "Contig100169_2367", "Contig100169_2367",

scikit learn: desired amount of Best Features (k) not selected

落爺英雄遲暮 提交于 2019-12-23 07:58:06
问题 I am trying to select the best features using chi-square (scikit-learn 0.10). From a total of 80 training documents I first extract 227 feature, and from these 227 features I want to select the top 10 ones. my_vectorizer = CountVectorizer(analyzer=MyAnalyzer()) X_train = my_vectorizer.fit_transform(train_data) X_test = my_vectorizer.transform(test_data) Y_train = np.array(train_labels) Y_test = np.array(test_labels) X_train = np.clip(X_train.toarray(), 0, 1) X_test = np.clip(X_test.toarray(),

Chi-square testing for constraining a parameter

↘锁芯ラ 提交于 2019-12-22 12:19:49
问题 I have an important question about the use of chi^2 test to constrain a parameter in cosmology. I appreciate your help. Please do not give this question negative rate (this question is important to me). Assume we have a data file ( data.txt ) concluding 600 data and this data file has 3 columns, first column is redshift(z), second column is observational dL(m_obs) and third column is error(err). As we know chi^2 function is chi^2=(m_obs-m_theo)**2/err**2 #chi^2=sigma((m_obs-m_theo)**2/err**2)

Pearson's Chi Square Test Python

时间秒杀一切 提交于 2019-12-22 00:28:23
问题 I have two arrays that I would like to do a Pearson's Chi Square test (goodness of fit). I want to test whether or not there is a significant difference between the expected and observed results. observed = [11294, 11830, 10820, 12875] expected = [10749, 10940, 10271, 11937] I want to compare 11294 with 10749, 11830 with 10940, 10820 with 10271, etc. Here's what I have >>> from scipy.stats import chisquare >>> chisquare(f_obs=[11294, 11830, 10820, 12875],f_exp=[10749, 10940, 10271, 11937])

Automate Chi-square across categories and columns

坚强是说给别人听的谎言 提交于 2019-12-21 20:01:27
问题 I have a survey dataframe containing several questions (columns) coded as 1=agree/0=disagree. Respondents (rows) are categorized according to metrics "age" ("young","middle","old"), "region" ("East","Mid","West"), etc. There are around 30 categories in total (3 ages, 3 regions, 2 genders, 11 occupations, etc.). Within each metric, categories are non-overlapping and of different sizes. This simulates a cut-down version of the dataset: n<-400 set.seed(1) data<-data.frame(age=sample(c('young',

Feature selection for multilabel classification (scikit-learn)

喜欢而已 提交于 2019-12-21 06:04:13
问题 I'm trying to do a feature selection by chi-square method in scikit-learn (sklearn.feature_selection.SelectKBest). When I'm trying to apply this to a multilabel problem, I get this warning: UserWarning: Duplicate scores. Result may depend on feature ordering.There are probably duplicate features, or you used a classification score for a regression task. warn("Duplicate scores. Result may depend on feature ordering." Why is it appearning and how to properly apply feature selection is this case

Chi-Squared Probability Function in C++

半世苍凉 提交于 2019-12-21 05:18:12
问题 The following code of mine computes the confidence interval using Chi-square's 'quantile' and probability function from Boost. I am trying to implement this function as to avoid dependency to Boost. Is there any resource where can I find such implementation? #include <boost/math/distributions/chi_squared.hpp> #include <boost/cstdint.hpp> using namespace std; using boost::math::chi_squared; using boost::math::quantile; vector <double> ConfidenceInterval(double x) { vector <double> ConfInts; //

Python scipy chisquare returns different values than R chisquare

我们两清 提交于 2019-12-21 05:04:29
问题 I am trying to use scipy.stats.chisquare . I have built a toy example: In [1]: import scipy.stats as sps In [2]: import numpy as np In [3]: sps.chisquare(np.array([38,27,23,17,11,4]), np.array([98, 100, 80, 85,60,23])) Out[11]: (240.74951271813072, 5.302429887719704e-50) The same example in R returns: > chisq.test(matrix(c(38,27,23,17,11,4,98,100,80,85,60,23), ncol=2)) Pearson's Chi-squared test data: matrix(c(38, 27, 23, 17, 11, 4, 98, 100, 80, 85, 60, 23), ncol = 2) X-squared = 7.0762, df =

Chi-square p value matrix in r

无人久伴 提交于 2019-12-19 04:22:33
问题 Is there any way to find the chi-square p-value matrix in 'R' (a matrix with the p-values between the attributes)? As an example, consider the the iris data set. I am looking for a matrix as follows: | | Sepal length | Sepal width | Petal length | Petal width | Species | |----------------|--------------|-------------|--------------|-------------|---------| | Sepal length | | | | | | | Sepal width | | | | | | | Petal length | | | | | | | Petal width | | | | | | | Species | | | | | | The