missing-data

Function to count NA values at each level of a factor

心不动则不痛 提交于 2019-11-29 08:35:13
I have this dataframe: set.seed(50) data <- data.frame(age=c(rep("juv", 10), rep("ad", 10)), sex=c(rep("m", 10), rep("f", 10)), size=c(rep("large", 10), rep("small", 10)), length=rnorm(20), width=rnorm(20), height=rnorm(20)) data$length[sample(1:20, size=8, replace=F)] <- NA data$width[sample(1:20, size=8, replace=F)] <- NA data$height[sample(1:20, size=8, replace=F)] <- NA age sex size length width height 1 juv m large NA -0.34992735 0.10955641 2 juv m large -0.84160374 NA -0.41341885 3 juv m large 0.03299794 -1.58987765 NA 4 juv m large NA NA NA 5 juv m large -1.72760411 NA 0.09534935 6 juv

R — Carry last observation forward n times

若如初见. 提交于 2019-11-29 08:16:13
I am attempting to carry non-missing observations forward and populate the next two missing observations (although I imagine a solution to this problem would be broadly applicable to carrying observations forward through n rows...). In the example data frame below I would like to carry forward (propagate) the flag_a and flag_b values for each id for two rows. Here is an example of my data with the desired output included: id <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2) flag_a <- as.numeric(c(NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA)) flag_b <- as

str.format() raises KeyError

半城伤御伤魂 提交于 2019-11-29 04:20:29
问题 The following code raises a KeyError exception: addr_list_formatted = [] addr_list_idx = 0 for addr in addr_list: # addr_list is a list addr_list_idx = addr_list_idx + 1 addr_list_formatted.append(""" "{0}" { "gamedir" "str" "address" "{1}" } """.format(addr_list_idx, addr)) Why? I am using Python 3.1. 回答1: The problem is those { and } characters you have there that don't specify a key for formatting. You need to double them up, so change your code to: addr_list_formatted.append(""" "{0}" {{

How to lowercase a pandas dataframe string column if it has missing values?

风流意气都作罢 提交于 2019-11-29 02:01:48
问题 The following code does not work. import pandas as pd import numpy as np df=pd.DataFrame(['ONE','Two', np.nan],columns=['x']) xLower = df["x"].map(lambda x: x.lower()) How should I tweak it to get xLower = ['one','two',np.nan] ? Efficiency is important since the real data frame is huge. 回答1: use pandas vectorized string methods; as in the documentation: these methods exclude missing/NA values automatically .str.lower() is the very first example there; >>> df['x'].str.lower() 0 one 1 two 2 NaN

How to lowercase a pandas dataframe string column if it has missing values?

試著忘記壹切 提交于 2019-11-28 22:18:02
The following code does not work. import pandas as pd import numpy as np df=pd.DataFrame(['ONE','Two', np.nan],columns=['x']) xLower = df["x"].map(lambda x: x.lower()) How should I tweak it to get xLower = ['one','two',np.nan] ? Efficiency is important since the real data frame is huge. use pandas vectorized string methods ; as in the documentation: these methods exclude missing/NA values automatically .str.lower() is the very first example there; >>> df['x'].str.lower() 0 one 1 two 2 NaN Name: x, dtype: object Another possible solution, in case the column has not only strings but numbers too,

What is the difference between <NA> and NA?

廉价感情. 提交于 2019-11-28 21:28:01
问题 I have a factor named SMOKE with levels "Y" and "N". Missing values were replaced with NA (from the initial level "NULL"). However when I view the factor I get something like this: head(SMOKE) N N <NA> Y Y N Levels: Y N Why is R displaying NA as <NA> ? And is there a difference? 回答1: When you are dealing with factors , when the NA is wrapped in angled brackets ( <NA> ), that indicates thtat it is in fact NA. When it is NA without brackets, then it is not NA, but rather a proper factor whose

Fill in missing pandas data with previous non-missing value, grouped by key

谁都会走 提交于 2019-11-28 20:40:00
I am dealing with pandas DataFrames like this: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 NaN 5 2 NaN 6 1 300 7 1 NaN I would like to replace each NAN 'x' with the previous non-NAN 'x' from a row with the same 'id' value: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 20 5 2 200 6 1 300 7 1 300 Is there some slick way to do this without manually looping over rows? You could perform a groupby/forward-fill operation on each group: import numpy as np import pandas as pd df = pd.DataFrame({'id': [1,1,2,2,1,2,1,1], 'x':[10,20,100,200,np.nan,np.nan,300,np.nan]}) df['x'] = df.groupby(['id'])['x'].ffill() print

How do I handle multiple kinds of missingness in R?

随声附和 提交于 2019-11-28 19:09:06
问题 Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate: 0-99 Data -1 Question not asked -5 Do not know -7 Refused to respond -9 Module not asked Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing

How to fill NAs with LOCF by factors in data frame, split by country

我怕爱的太早我们不能终老 提交于 2019-11-28 18:48:53
I have the following data frame (simplified) with the country variable as a factor and the value variable has missing values: country value AUT NA AUT 5 AUT NA AUT NA GER NA GER NA GER 7 GER NA GER NA The following generates the above data frame: data <- data.frame(country=c("AUT", "AUT", "AUT", "AUT", "GER", "GER", "GER", "GER", "GER"), value=c(NA, 5, NA, NA, NA, NA, 7, NA, NA)) Now, I would like to replace the NA values in each country subset using the method last observation carried forward (LOCF). I know the command na.locf in the zoo package. data <- na.locf(data) would give me the

Handling missing/incomplete data in R--is there function to mask but not remove NAs?

断了今生、忘了曾经 提交于 2019-11-28 18:40:57
As you would expect from a DSL aimed at data analysis, R handles missing/incomplete data very well, for instance: Many R functions have an na.rm flag that when set to TRUE , remove the NAs: >>> v = mean( c(5, NA, 6, 12, NA, 87, 9, NA, 43, 67), na.rm=T) >>> v (5, 6, 12, 87, 9, 43, 67) But if you want to deal with NAs before the function call, you need to do something like this: to remove each 'NA' from a vector: vx = vx[!is.na(a)] to remove each 'NA' from a vector and replace it w/ a '0': ifelse(is.na(vx), 0, vx) to remove entire each row that contains 'NA' from a data frame: dfx = dfx[complete