missing-data | 易学教程

Why does max() sometimes return nan and sometimes ignores it?

阅读更多关于 Why does max() sometimes return nan and sometimes ignores it?

问题 This question is motivated by an answer I gave a while ago. Let's say I have a dataframe like this import numpy as np import pandas as pd df = pd.DataFrame({'a': [1, 2, np.nan], 'b': [3, np.nan, 10], 'c':[np.nan, 5, 34]}) a b c 0 1.0 3.0 NaN 1 2.0 NaN 5.0 2 NaN 10.0 34.0 and I want to replace the NaN by the maximum of the row, I can do df.apply(lambda row: row.fillna(row.max()), axis=1) which gives me the desired output a b c 0 1.0 3.0 3.0 1 2.0 5.0 5.0 2 34.0 10.0 34.0 When I, however, use

Function to count NA values at each level of a factor

阅读更多关于 Function to count NA values at each level of a factor

问题 I have this dataframe: set.seed(50) data <- data.frame(age=c(rep("juv", 10), rep("ad", 10)), sex=c(rep("m", 10), rep("f", 10)), size=c(rep("large", 10), rep("small", 10)), length=rnorm(20), width=rnorm(20), height=rnorm(20)) data$length[sample(1:20, size=8, replace=F)] <- NA data$width[sample(1:20, size=8, replace=F)] <- NA data$height[sample(1:20, size=8, replace=F)] <- NA age sex size length width height 1 juv m large NA -0.34992735 0.10955641 2 juv m large -0.84160374 NA -0.41341885 3 juv

ggplot2: show missing value colour in legend

阅读更多关于 ggplot2: show missing value colour in legend

Just wondering what is required so the colour for missing values is shown in the legend? Looking at example from the UseR! ggplot2 book, p94 p <- qplot(sleep_total, sleep_cycle, data=msleep, colour=vore) p + scale_colour_hue(na.value = "Black") p + scale_colour_hue("What does \nit eat?", na.value="Black", breaks=c("herbi", "carni", "omni", "insecti", NA), labels=c("plants", "meat", "both", "insects", "don't know")) the data point for vore=NA is shown in the plot but NA is not listed in the legend. Thanks Workaround for the problem would be to replace NA values in your data with same other

Replace mean or mode for missing values in R

阅读更多关于 Replace mean or mode for missing values in R

问题 I have a large database made up of mixed data types (numeric, character, factor, ordinal factor) with missing values, and I am trying to create a for loop to substitute the missing values using either the mean of the respective column if numerical or the mode if character/factor. This is what I have until now: #fake array: age<- c(5,8,10,12,NA) a <- factor(c("aa", "bb", NA, "cc", "cc")) b <- c("banana", "apple", "pear", "grape", NA) df_test <- data.frame(age=age, a=a, b=b) df_test$b <- as

how to insert missing observations on a data frame

阅读更多关于 how to insert missing observations on a data frame

I have a data that are observations over time. Unfortunately, some large gaps of time points are missing on a treatment. They are not coded as NA and if I make a plot out of them it becomes apparent. My data frame looks like this. The number of samples per time points are irregular. (edit: sorry for not making the example reproducible)s structure(list(A = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,

R — Carry last observation forward n times

阅读更多关于 R — Carry last observation forward n times

问题 I am attempting to carry non-missing observations forward and populate the next two missing observations (although I imagine a solution to this problem would be broadly applicable to carrying observations forward through n rows...). In the example data frame below I would like to carry forward (propagate) the flag_a and flag_b values for each id for two rows. Here is an example of my data with the desired output included: id <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2) flag_a <-

Insert missing time rows into a dataframe

阅读更多关于 Insert missing time rows into a dataframe

Let's say I have a dataframe: df <- data.frame(group = c('A','A','A','B','B','B'), time = c(1,2,4,1,2,3), data = c(5,6,7,8,9,10)) What I want to do is insert data into the data frame where it was missing in the sequence. So in the above example, I'm missing data for time = 3 for group A, and time = 4 for Group B. I would essentially want to put 0's in the place of the data column. How would I go about adding these additional rows? The goal would be: df <- data.frame(group = c('A','A','A','A','B','B','B','B'), time = c(1,2,3,4,1,2,3,4), data = c(5,6,0,7,8,9,10,0)) My real data is a couple

Missing values in scikits machine learning

阅读更多关于 Missing values in scikits machine learning

Is it possible to have missing values in scikit-learn ? How should they be represented? I couldn't find any documentation about that. Missing values are simply not supported in scikit-learn. There has been discussion on the mailing list about this before, but no attempt to actually write code to handle them. Whatever you do, don't use NaN to encode missing values, since many of the algorithms refuse to handle samples containing NaNs. The above answer is outdated; the latest release of scikit-learn has a class Imputer that does simple, per-feature missing value imputation. You can feed it

Pandas: groupby forward fill with datetime index

阅读更多关于 Pandas: groupby forward fill with datetime index

问题 I have a dataset that has two columns: company, and value. It has a datetime index, which contains duplicates (on the same day, different companies have different values). The values have missing data, so I want to forward fill the missing data with the previous datapoint from the same company. However, I can't seem to find a good way to do this without running into odd groupby errors, suggesting that I'm doing something wrong. Toy data: a = pd.DataFrame({'a': [1, 2, None], 'b': [12,None,14]}

Leaving values blank if not passed in str.format

阅读更多关于 Leaving values blank if not passed in str.format

I've run into a fairly simple issue that I can't come up with an elegant solution for. I'm creating a string using str.format in a function that is passed in a dict of substitutions to use for the format. I want to create the string and format it with the values if they're passed and leave them blank otherwise. Ex kwargs = {"name": "mark"} "My name is {name} and I'm really {adjective}.".format(**kwargs) should return "My name is mark and I'm really ." instead of throwing a KeyError (Which is what would happen if we don't do anything). Embarrassingly, I can't even come up with an inelegant