missing-data | 易学教程

How to handle missing NaNs for machine learning in python

阅读更多关于 How to handle missing NaNs for machine learning in python

How to handle missing values in datasets before applying machine learning algorithm??. I noticed that it is not a smart thing to drop missing NAN values. I usually do interpolate (compute mean) using pandas and fill it up the data which is kind of works and improves the classification accuracy but may not be the best thing to do. Here is a very important question. What is the best way to handle missing values in data set? For example if you see this dataset, only 30% has original data. Int64Index: 7049 entries, 0 to 7048 Data columns (total 31 columns): left_eye_center_x 7039 non-null float64

R package caret confusionMatrix with missing categories

阅读更多关于 R package caret confusionMatrix with missing categories

问题 I am using the function confusionMatrix in the R package caret to calculate some statistics for some data I have. I have been putting my predictions as well as my actual values into the table function to get the table to be used in the confusionMatrix function as so: table(predicted,actual) However, there are multiple possible outcomes (e.g. A, B, C, D), and my predictions do not always represent all the possibilities (e.g. only A, B, D). The resulting output of the table function does not

Dealing with missing values for correlations calculation

阅读更多关于 Dealing with missing values for correlations calculation

I have huge matrix with a lot of missing values. I want to get the correlation between variables. 1. Is the solution cor(na.omit(matrix)) better than below? cor(matrix, use = "pairwise.complete.obs") I already have selected only variables having more than 20% of missing values. 2. Which is the best method to make sense ? I would vote for the second option. Sounds like you have a fair amount of missing data and so you would be looking for a sensible multiple imputation strategy to fill in the spaces. See Harrell's text "Regression Modeling Strategies" for a wealth of guidance on 'how's to do

What is the difference between <NA> and NA?

阅读更多关于 What is the difference between and NA?

I have a factor named SMOKE with levels "Y" and "N". Missing values were replaced with NA (from the initial level "NULL"). However when I view the factor I get something like this: head(SMOKE) N N <NA> Y Y N Levels: Y N Why is R displaying NA as <NA> ? And is there a difference? When you are dealing with factors , when the NA is wrapped in angled brackets ( <NA> ), that indicates thtat it is in fact NA. When it is NA without brackets, then it is not NA, but rather a proper factor whose label is "NA" # Note a 'real' NA and a string with the word "NA" x <- factor(c("hello", NA, "world", "NA")) x

How do I handle multiple kinds of missingness in R?

阅读更多关于 How do I handle multiple kinds of missingness in R?

Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate: 0-99 Data -1 Question not asked -5 Do not know -7 Refused to respond -9 Module not asked Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is

How do I get a summary count of missing/NaN data by column in 'pandas'?

阅读更多关于 How do I get a summary count of missing/NaN data by column in 'pandas'?

In R I can quickly see a count of missing data using the summary command, but the equivalent pandas DataFrame method, describe does not report these values. I gather I can do something like len(mydata.index) - mydata.count() to compute the number of missing values for each column, but I wonder if there's a better idiom (or if my approach is even right). Both describe and info report the count of non-missing values. In [1]: df = DataFrame(np.random.randn(10,2)) In [2]: df.iloc[3:6,0] = np.nan In [3]: df Out[3]: 0 1 0 -0.560342 1.862640 1 -1.237742 0.596384 2 0.603539 -1.561594 3 NaN 3.018954 4

Elegant way to report missing values in a data.frame

阅读更多关于 Elegant way to report missing values in a data.frame

Here's a little piece of code I wrote to report variables with missing values from a data frame. I'm trying to think of a more elegant way to do this, one that perhaps returns a data.frame, but I'm stuck: for (Var in names(airquality)) { missing <- sum(is.na(airquality[,Var])) if (missing > 0) { print(c(Var,missing)) } } Edit: I'm dealing with data.frames with dozens to hundreds of variables, so it's key that we only report variables with missing values. Just use sapply > sapply(airquality, function(x) sum(is.na(x))) Ozone Solar.R Wind Temp Month Day 37 7 0 0 0 0 You could also use apply or

Winsorizing data by column in pandas with NaN

阅读更多关于 Winsorizing data by column in pandas with NaN

I'd like to winsorize several columns of data in a pandas Data Frame. Each column has some NaN, which affects the winsorization, so they need to be removed. The only way I know how to do this is to remove them for all of the data, rather than remove them only column-by-column. MWE: import numpy as np import pandas as pd from scipy.stats.mstats import winsorize # Create Dataframe N, M, P = 10**5, 4, 10**2 dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P) df = pd.DataFrame(np.random.random((N, M)) , index=dates) df.index.names = ['DATE'] df.columns = ['one','two','three',

Find missing month after grouping with dplyr

阅读更多关于 Find missing month after grouping with dplyr

I have a data frame with two columns that I am grouping by with dplyr , a column of months (as numerics, e.g. 1 through 12), and several columns with statistical data following that (values unimportant). An example: ID_1 ID_2 month st1 st2 1 1 1 0.5 0.2 1 1 2 0.7 0.9 1 1 3 1.1 1.7 1 1 4 2.6 0.8 1 1 5 1.8 1.3 1 1 6 2.1 2.2 1 1 7 0.5 0.2 1 1 8 0.7 0.9 1 1 9 1.1 1.7 1 1 10 2.6 0.8 1 1 11 1.8 1.3 1 1 12 2.1 2.2 1 2 1 0.5 0.2 1 2 2 0.7 0.9 1 2 3 1.1 1.7 1 2 4 2.6 0.8 1 2 5 1.8 1.3 1 2 6 2.1 2.2 1 2 7 0.5 0.2 1 2 9 1.1 1.7 1 2 10 2.6 0.8 1 2 11 1.8 1.3 1 2 12 2.1 2.2 For the second grouping ( ID_1 =

Replace all NA with FALSE in selected columns in R

阅读更多关于 Replace all NA with FALSE in selected columns in R

问题 I have a question similar to this one, but my dataset is a bit bigger: 50 columns with 1 column as UID and other columns carrying either TRUE or NA , I want to change all the NA to FALSE , but I don't want to use explicit loop. Can plyr do the trick? Thanks. UPDATE #1 Thanks for quick reply, but what if my dataset is like below: df <- data.frame( id = c(rep(1:19),NA), x1 = sample(c(NA,TRUE), 20, replace = TRUE), x2 = sample(c(NA,TRUE), 20, replace = TRUE) ) I only want X1 and X2 to be