missing-data

How to handle missing NaNs for machine learning in python

我的梦境 提交于 2019-11-30 04:15:27
How to handle missing values in datasets before applying machine learning algorithm??. I noticed that it is not a smart thing to drop missing NAN values. I usually do interpolate (compute mean) using pandas and fill it up the data which is kind of works and improves the classification accuracy but may not be the best thing to do. Here is a very important question. What is the best way to handle missing values in data set? For example if you see this dataset, only 30% has original data. Int64Index: 7049 entries, 0 to 7048 Data columns (total 31 columns): left_eye_center_x 7039 non-null float64

R package caret confusionMatrix with missing categories

依然范特西╮ 提交于 2019-11-30 04:10:54
问题 I am using the function confusionMatrix in the R package caret to calculate some statistics for some data I have. I have been putting my predictions as well as my actual values into the table function to get the table to be used in the confusionMatrix function as so: table(predicted,actual) However, there are multiple possible outcomes (e.g. A, B, C, D), and my predictions do not always represent all the possibilities (e.g. only A, B, D). The resulting output of the table function does not

Dealing with missing values for correlations calculation

眉间皱痕 提交于 2019-11-30 01:29:27
I have huge matrix with a lot of missing values. I want to get the correlation between variables. 1. Is the solution cor(na.omit(matrix)) better than below? cor(matrix, use = "pairwise.complete.obs") I already have selected only variables having more than 20% of missing values. 2. Which is the best method to make sense ? I would vote for the second option. Sounds like you have a fair amount of missing data and so you would be looking for a sensible multiple imputation strategy to fill in the spaces. See Harrell's text "Regression Modeling Strategies" for a wealth of guidance on 'how's to do

What is the difference between <NA> and NA?

无人久伴 提交于 2019-11-29 23:34:42
I have a factor named SMOKE with levels "Y" and "N". Missing values were replaced with NA (from the initial level "NULL"). However when I view the factor I get something like this: head(SMOKE) N N <NA> Y Y N Levels: Y N Why is R displaying NA as <NA> ? And is there a difference? When you are dealing with factors , when the NA is wrapped in angled brackets ( <NA> ), that indicates thtat it is in fact NA. When it is NA without brackets, then it is not NA, but rather a proper factor whose label is "NA" # Note a 'real' NA and a string with the word "NA" x <- factor(c("hello", NA, "world", "NA")) x

How do I handle multiple kinds of missingness in R?

风流意气都作罢 提交于 2019-11-29 21:48:08
Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate: 0-99 Data -1 Question not asked -5 Do not know -7 Refused to respond -9 Module not asked Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is

How do I get a summary count of missing/NaN data by column in 'pandas'?

孤人 提交于 2019-11-29 21:28:57
In R I can quickly see a count of missing data using the summary command, but the equivalent pandas DataFrame method, describe does not report these values. I gather I can do something like len(mydata.index) - mydata.count() to compute the number of missing values for each column, but I wonder if there's a better idiom (or if my approach is even right). Both describe and info report the count of non-missing values. In [1]: df = DataFrame(np.random.randn(10,2)) In [2]: df.iloc[3:6,0] = np.nan In [3]: df Out[3]: 0 1 0 -0.560342 1.862640 1 -1.237742 0.596384 2 0.603539 -1.561594 3 NaN 3.018954 4

Elegant way to report missing values in a data.frame

爱⌒轻易说出口 提交于 2019-11-29 18:43:40
Here's a little piece of code I wrote to report variables with missing values from a data frame. I'm trying to think of a more elegant way to do this, one that perhaps returns a data.frame, but I'm stuck: for (Var in names(airquality)) { missing <- sum(is.na(airquality[,Var])) if (missing > 0) { print(c(Var,missing)) } } Edit: I'm dealing with data.frames with dozens to hundreds of variables, so it's key that we only report variables with missing values. Just use sapply > sapply(airquality, function(x) sum(is.na(x))) Ozone Solar.R Wind Temp Month Day 37 7 0 0 0 0 You could also use apply or

Winsorizing data by column in pandas with NaN

可紊 提交于 2019-11-29 15:41:35
I'd like to winsorize several columns of data in a pandas Data Frame. Each column has some NaN, which affects the winsorization, so they need to be removed. The only way I know how to do this is to remove them for all of the data, rather than remove them only column-by-column. MWE: import numpy as np import pandas as pd from scipy.stats.mstats import winsorize # Create Dataframe N, M, P = 10**5, 4, 10**2 dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P) df = pd.DataFrame(np.random.random((N, M)) , index=dates) df.index.names = ['DATE'] df.columns = ['one','two','three',

Find missing month after grouping with dplyr

丶灬走出姿态 提交于 2019-11-29 15:22:22
I have a data frame with two columns that I am grouping by with dplyr , a column of months (as numerics, e.g. 1 through 12), and several columns with statistical data following that (values unimportant). An example: ID_1 ID_2 month st1 st2 1 1 1 0.5 0.2 1 1 2 0.7 0.9 1 1 3 1.1 1.7 1 1 4 2.6 0.8 1 1 5 1.8 1.3 1 1 6 2.1 2.2 1 1 7 0.5 0.2 1 1 8 0.7 0.9 1 1 9 1.1 1.7 1 1 10 2.6 0.8 1 1 11 1.8 1.3 1 1 12 2.1 2.2 1 2 1 0.5 0.2 1 2 2 0.7 0.9 1 2 3 1.1 1.7 1 2 4 2.6 0.8 1 2 5 1.8 1.3 1 2 6 2.1 2.2 1 2 7 0.5 0.2 1 2 9 1.1 1.7 1 2 10 2.6 0.8 1 2 11 1.8 1.3 1 2 12 2.1 2.2 For the second grouping ( ID_1 =

Replace all NA with FALSE in selected columns in R

佐手、 提交于 2019-11-29 11:19:44
问题 I have a question similar to this one, but my dataset is a bit bigger: 50 columns with 1 column as UID and other columns carrying either TRUE or NA , I want to change all the NA to FALSE , but I don't want to use explicit loop. Can plyr do the trick? Thanks. UPDATE #1 Thanks for quick reply, but what if my dataset is like below: df <- data.frame( id = c(rep(1:19),NA), x1 = sample(c(NA,TRUE), 20, replace = TRUE), x2 = sample(c(NA,TRUE), 20, replace = TRUE) ) I only want X1 and X2 to be