missing-data

reshape from base vs dcast from reshape2 with missing values

放肆的年华 提交于 2019-12-31 02:45:32
问题 Whis this data frame, df <- expand.grid(id="01", parameter=c("blood", "saliva"), visit=c("V1", "V2", "V3")) df$value <- c(1:6) df$sex <- rep("f", 6) df > df id parameter visit value sex 1 01 blood V1 1 f 2 01 saliva V1 2 f 3 01 blood V2 3 f 4 01 saliva V2 4 f 5 01 blood V3 5 f 6 01 saliva V3 6 f When I reshape it in the "wide" format, I get identical results with both the base reshape function and the dcast function from reshape2 . reshape(df, timevar="visit", idvar=c("id", "parameter", "sex"

Pandas read_csv fills empty values with string 'nan', instead of parsing date

老子叫甜甜 提交于 2019-12-30 06:54:47
问题 I assign np.nan to the missing values in a column of a DataFrame. The DataFrame is then written to a csv file using to_csv. The resulting csv file correctly has nothing between the commas for the missing values if I open the file with a text editor. But when I read that csv file back into a DataFrame using read_csv, the missing values become the string 'nan' instead of NaN. As a result, isnull() does not work. For example: In [13]: df Out[13]: index value date 0 975 25.35 nan 1 976 26.28 nan

Dealing with missing values for correlations calculation

走远了吗. 提交于 2019-12-29 19:26:04
问题 I have huge matrix with a lot of missing values. I want to get the correlation between variables. 1. Is the solution cor(na.omit(matrix)) better than below? cor(matrix, use = "pairwise.complete.obs") I already have selected only variables having more than 20% of missing values. 2. Which is the best method to make sense ? 回答1: I would vote for the second option. Sounds like you have a fair amount of missing data and so you would be looking for a sensible multiple imputation strategy to fill in

Dealing with missing values for correlations calculation

天涯浪子 提交于 2019-12-29 19:23:12
问题 I have huge matrix with a lot of missing values. I want to get the correlation between variables. 1. Is the solution cor(na.omit(matrix)) better than below? cor(matrix, use = "pairwise.complete.obs") I already have selected only variables having more than 20% of missing values. 2. Which is the best method to make sense ? 回答1: I would vote for the second option. Sounds like you have a fair amount of missing data and so you would be looking for a sensible multiple imputation strategy to fill in

Delete rows with blank values in one particular column

寵の児 提交于 2019-12-29 02:33:10
问题 I am working on a large dataset, with some rows with NAs and others with blanks: df <- data.frame(ID = c(1:7), home_pc = c("","CB4 2DT", "NE5 7TH", "BY5 8IB", "DH4 6PB","MP9 7GH","KN4 5GH"), start_pc = c(NA,"Home", "FC5 7YH","Home", "CB3 5TH", "BV6 5PB",NA), end_pc = c(NA,"CB5 4FG","Home","","Home","",NA)) How do I remove the NAs and blanks in one go (in the start_pc and end_pc columns)? I have in the past used: df<- df[-which(is.na(df$start_pc)), ] ... to remove the NAs - is there a similar

Sub-function in grouping function using dplyr

梦想与她 提交于 2019-12-25 08:48:18
问题 I'm using the dpylr package to count missing values for subgroups for each of my variables. I used a mini-function: NAobs <- function(x) length(x[is.na(x)]) ####function to count missing data for variables to count missing values. Because I have quite some variables and I wanted to add a bit more information (sample size per group, and percentage of missing data per group) I wrote the following code, and inserted one variable (task_1) to check it. library(dplyr) group_by(DataRT, class) %>%

Time series Data Missing Time values and Data values

你离开我真会死。 提交于 2019-12-25 06:30:41
问题 I have the following time-series dataset sample here: ymd rf 19820103 3 19820104 9 19820118 4 19820119 2 19820122 0 19820218 5 Now the dataset is supposed to be organized in a daily time-series manner. More specifically, ymd is supposed to range continuously from 19820101 through 19820230. However, as you can see from the sample above, the dataset is not continuous and does not contain days such as "19820101" and "19820102", etc. For these dates where the dataset is unavailable, I'd like to

Little's MCAR Test in R BaylorEdPsych package does not work

社会主义新天地 提交于 2019-12-24 17:44:06
问题 Okay so here's the deal. I have to use the BaylorEdPsych package in R to test whether the dataset that I have is MCAR or not. I ran the LittleMCAR function in it with the sample dataset (EndersTable1_1) and it worked flawlessly. When I try to run the dataset that I have into the function I get this error: Error in eigen(sampmat, symmetric = TRUE) : infinite or missing values in 'x' I don't understand why this would throw an error when my dataset conforms to the structure of the sample data.

NaN in data frame: when first observation of time series is NaN, frontfill with first available, otherwise carry over last / previous observation

孤街醉人 提交于 2019-12-24 17:33:56
问题 I am performing an ADF-test from statsmodels. The value series can have missing obversations. In fact, I am dropping the analysis if the fraction of NaNs is larger than c. However, if the series makes it through the I get the problem, that the adfuller cannot deal with missing data. Since this is training data with a minimum framesize, I would like to do: 1) if x(t=0) = NaN, then find the next non-NaN value (t>0) 2) otherwise if x(t) = NaN, then x(t) = x(t-1) So I am compromising here my

NaN in data frame: when first observation of time series is NaN, frontfill with first available, otherwise carry over last / previous observation

[亡魂溺海] 提交于 2019-12-24 17:33:16
问题 I am performing an ADF-test from statsmodels. The value series can have missing obversations. In fact, I am dropping the analysis if the fraction of NaNs is larger than c. However, if the series makes it through the I get the problem, that the adfuller cannot deal with missing data. Since this is training data with a minimum framesize, I would like to do: 1) if x(t=0) = NaN, then find the next non-NaN value (t>0) 2) otherwise if x(t) = NaN, then x(t) = x(t-1) So I am compromising here my