missing-data | 易学教程

Function to count NA values at each level of a factor

阅读更多关于 Function to count NA values at each level of a factor

I have this dataframe: set.seed(50) data <- data.frame(age=c(rep("juv", 10), rep("ad", 10)), sex=c(rep("m", 10), rep("f", 10)), size=c(rep("large", 10), rep("small", 10)), length=rnorm(20), width=rnorm(20), height=rnorm(20)) data$length[sample(1:20, size=8, replace=F)] <- NA data$width[sample(1:20, size=8, replace=F)] <- NA data$height[sample(1:20, size=8, replace=F)] <- NA age sex size length width height 1 juv m large NA -0.34992735 0.10955641 2 juv m large -0.84160374 NA -0.41341885 3 juv m large 0.03299794 -1.58987765 NA 4 juv m large NA NA NA 5 juv m large -1.72760411 NA 0.09534935 6 juv

R — Carry last observation forward n times

阅读更多关于 R — Carry last observation forward n times

I am attempting to carry non-missing observations forward and populate the next two missing observations (although I imagine a solution to this problem would be broadly applicable to carrying observations forward through n rows...). In the example data frame below I would like to carry forward (propagate) the flag_a and flag_b values for each id for two rows. Here is an example of my data with the desired output included: id <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2) flag_a <- as.numeric(c(NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA)) flag_b <- as

str.format() raises KeyError

阅读更多关于 str.format() raises KeyError

问题 The following code raises a KeyError exception: addr_list_formatted = [] addr_list_idx = 0 for addr in addr_list: # addr_list is a list addr_list_idx = addr_list_idx + 1 addr_list_formatted.append(""" "{0}" { "gamedir" "str" "address" "{1}" } """.format(addr_list_idx, addr)) Why? I am using Python 3.1. 回答1: The problem is those { and } characters you have there that don't specify a key for formatting. You need to double them up, so change your code to: addr_list_formatted.append(""" "{0}" {{

How to lowercase a pandas dataframe string column if it has missing values?

阅读更多关于 How to lowercase a pandas dataframe string column if it has missing values?

问题 The following code does not work. import pandas as pd import numpy as np df=pd.DataFrame(['ONE','Two', np.nan],columns=['x']) xLower = df["x"].map(lambda x: x.lower()) How should I tweak it to get xLower = ['one','two',np.nan] ? Efficiency is important since the real data frame is huge. 回答1: use pandas vectorized string methods; as in the documentation: these methods exclude missing/NA values automatically .str.lower() is the very first example there; >>> df['x'].str.lower() 0 one 1 two 2 NaN

How to lowercase a pandas dataframe string column if it has missing values?

阅读更多关于 How to lowercase a pandas dataframe string column if it has missing values?

The following code does not work. import pandas as pd import numpy as np df=pd.DataFrame(['ONE','Two', np.nan],columns=['x']) xLower = df["x"].map(lambda x: x.lower()) How should I tweak it to get xLower = ['one','two',np.nan] ? Efficiency is important since the real data frame is huge. use pandas vectorized string methods ; as in the documentation: these methods exclude missing/NA values automatically .str.lower() is the very first example there; >>> df['x'].str.lower() 0 one 1 two 2 NaN Name: x, dtype: object Another possible solution, in case the column has not only strings but numbers too,

What is the difference between <NA> and NA?

阅读更多关于 What is the difference between and NA?

问题 I have a factor named SMOKE with levels "Y" and "N". Missing values were replaced with NA (from the initial level "NULL"). However when I view the factor I get something like this: head(SMOKE) N N <NA> Y Y N Levels: Y N Why is R displaying NA as <NA> ? And is there a difference? 回答1: When you are dealing with factors , when the NA is wrapped in angled brackets ( <NA> ), that indicates thtat it is in fact NA. When it is NA without brackets, then it is not NA, but rather a proper factor whose

Fill in missing pandas data with previous non-missing value, grouped by key

阅读更多关于 Fill in missing pandas data with previous non-missing value, grouped by key

I am dealing with pandas DataFrames like this: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 NaN 5 2 NaN 6 1 300 7 1 NaN I would like to replace each NAN 'x' with the previous non-NAN 'x' from a row with the same 'id' value: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 20 5 2 200 6 1 300 7 1 300 Is there some slick way to do this without manually looping over rows? You could perform a groupby/forward-fill operation on each group: import numpy as np import pandas as pd df = pd.DataFrame({'id': [1,1,2,2,1,2,1,1], 'x':[10,20,100,200,np.nan,np.nan,300,np.nan]}) df['x'] = df.groupby(['id'])['x'].ffill() print

How do I handle multiple kinds of missingness in R?

阅读更多关于 How do I handle multiple kinds of missingness in R?

问题 Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate: 0-99 Data -1 Question not asked -5 Do not know -7 Refused to respond -9 Module not asked Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing

How to fill NAs with LOCF by factors in data frame, split by country

阅读更多关于 How to fill NAs with LOCF by factors in data frame, split by country

I have the following data frame (simplified) with the country variable as a factor and the value variable has missing values: country value AUT NA AUT 5 AUT NA AUT NA GER NA GER NA GER 7 GER NA GER NA The following generates the above data frame: data <- data.frame(country=c("AUT", "AUT", "AUT", "AUT", "GER", "GER", "GER", "GER", "GER"), value=c(NA, 5, NA, NA, NA, NA, 7, NA, NA)) Now, I would like to replace the NA values in each country subset using the method last observation carried forward (LOCF). I know the command na.locf in the zoo package. data <- na.locf(data) would give me the

Handling missing/incomplete data in R--is there function to mask but not remove NAs?

阅读更多关于 Handling missing/incomplete data in R--is there function to mask but not remove NAs?

As you would expect from a DSL aimed at data analysis, R handles missing/incomplete data very well, for instance: Many R functions have an na.rm flag that when set to TRUE , remove the NAs: >>> v = mean( c(5, NA, 6, 12, NA, 87, 9, NA, 43, 67), na.rm=T) >>> v (5, 6, 12, 87, 9, 43, 67) But if you want to deal with NAs before the function call, you need to do something like this: to remove each 'NA' from a vector: vx = vx[!is.na(a)] to remove each 'NA' from a vector and replace it w/ a '0': ifelse(is.na(vx), 0, vx) to remove entire each row that contains 'NA' from a data frame: dfx = dfx[complete