missing-data | 易学教程

reshape from base vs dcast from reshape2 with missing values

阅读更多关于 reshape from base vs dcast from reshape2 with missing values

问题 Whis this data frame, df <- expand.grid(id="01", parameter=c("blood", "saliva"), visit=c("V1", "V2", "V3")) df$value <- c(1:6) df$sex <- rep("f", 6) df > df id parameter visit value sex 1 01 blood V1 1 f 2 01 saliva V1 2 f 3 01 blood V2 3 f 4 01 saliva V2 4 f 5 01 blood V3 5 f 6 01 saliva V3 6 f When I reshape it in the "wide" format, I get identical results with both the base reshape function and the dcast function from reshape2 . reshape(df, timevar="visit", idvar=c("id", "parameter", "sex"

Pandas read_csv fills empty values with string 'nan', instead of parsing date

阅读更多关于 Pandas read_csv fills empty values with string 'nan', instead of parsing date

问题 I assign np.nan to the missing values in a column of a DataFrame. The DataFrame is then written to a csv file using to_csv. The resulting csv file correctly has nothing between the commas for the missing values if I open the file with a text editor. But when I read that csv file back into a DataFrame using read_csv, the missing values become the string 'nan' instead of NaN. As a result, isnull() does not work. For example: In [13]: df Out[13]: index value date 0 975 25.35 nan 1 976 26.28 nan

Dealing with missing values for correlations calculation

阅读更多关于 Dealing with missing values for correlations calculation

问题 I have huge matrix with a lot of missing values. I want to get the correlation between variables. 1. Is the solution cor(na.omit(matrix)) better than below? cor(matrix, use = "pairwise.complete.obs") I already have selected only variables having more than 20% of missing values. 2. Which is the best method to make sense ? 回答1: I would vote for the second option. Sounds like you have a fair amount of missing data and so you would be looking for a sensible multiple imputation strategy to fill in

Dealing with missing values for correlations calculation

阅读更多关于 Dealing with missing values for correlations calculation

Delete rows with blank values in one particular column

阅读更多关于 Delete rows with blank values in one particular column

问题 I am working on a large dataset, with some rows with NAs and others with blanks: df <- data.frame(ID = c(1:7), home_pc = c("","CB4 2DT", "NE5 7TH", "BY5 8IB", "DH4 6PB","MP9 7GH","KN4 5GH"), start_pc = c(NA,"Home", "FC5 7YH","Home", "CB3 5TH", "BV6 5PB",NA), end_pc = c(NA,"CB5 4FG","Home","","Home","",NA)) How do I remove the NAs and blanks in one go (in the start_pc and end_pc columns)? I have in the past used: df<- df[-which(is.na(df$start_pc)), ] ... to remove the NAs - is there a similar

Sub-function in grouping function using dplyr

阅读更多关于 Sub-function in grouping function using dplyr

问题 I'm using the dpylr package to count missing values for subgroups for each of my variables. I used a mini-function: NAobs <- function(x) length(x[is.na(x)]) ####function to count missing data for variables to count missing values. Because I have quite some variables and I wanted to add a bit more information (sample size per group, and percentage of missing data per group) I wrote the following code, and inserted one variable (task_1) to check it. library(dplyr) group_by(DataRT, class) %>%

Time series Data Missing Time values and Data values

阅读更多关于 Time series Data Missing Time values and Data values

问题 I have the following time-series dataset sample here: ymd rf 19820103 3 19820104 9 19820118 4 19820119 2 19820122 0 19820218 5 Now the dataset is supposed to be organized in a daily time-series manner. More specifically, ymd is supposed to range continuously from 19820101 through 19820230. However, as you can see from the sample above, the dataset is not continuous and does not contain days such as "19820101" and "19820102", etc. For these dates where the dataset is unavailable, I'd like to

Little's MCAR Test in R BaylorEdPsych package does not work

阅读更多关于 Little's MCAR Test in R BaylorEdPsych package does not work

问题 Okay so here's the deal. I have to use the BaylorEdPsych package in R to test whether the dataset that I have is MCAR or not. I ran the LittleMCAR function in it with the sample dataset (EndersTable1_1) and it worked flawlessly. When I try to run the dataset that I have into the function I get this error: Error in eigen(sampmat, symmetric = TRUE) : infinite or missing values in 'x' I don't understand why this would throw an error when my dataset conforms to the structure of the sample data.

NaN in data frame: when first observation of time series is NaN, frontfill with first available, otherwise carry over last / previous observation

阅读更多关于 NaN in data frame: when first observation of time series is NaN, frontfill with first available, otherwise carry over last / previous observation

问题 I am performing an ADF-test from statsmodels. The value series can have missing obversations. In fact, I am dropping the analysis if the fraction of NaNs is larger than c. However, if the series makes it through the I get the problem, that the adfuller cannot deal with missing data. Since this is training data with a minimum framesize, I would like to do: 1) if x(t=0) = NaN, then find the next non-NaN value (t>0) 2) otherwise if x(t) = NaN, then x(t) = x(t-1) So I am compromising here my

NaN in data frame: when first observation of time series is NaN, frontfill with first available, otherwise carry over last / previous observation

阅读更多关于 NaN in data frame: when first observation of time series is NaN, frontfill with first available, otherwise carry over last / previous observation