Find uniqueness in data frame withe rows NA?

问题

I have a data frame like below. I would like to find unique rows (uniqueness). But in this data I have 'NA'. I like if all value in one row with NA value is the same with other rows (like rows: 1,2,5) I want to ignore it, but if not same (like rows : 2,4) I like to keep it as unique row. For example, in rows 1 ,2 and 6 all values except NA are the same so because NA can be value '1 and 3' I like to remove this row and just keep row 2. Also, in row 6 values 2 and 3 (exclude NA) are the same as row c2 and c5 and there is possible NAs in c6 get same value like as c2 and c5, so this row is not unique.

Also, @ Sotos solution help me more but in last part after removing NA when make pattern for rows , his solution consider same pattern (23) for c8 and c6 and remove them. But actually c8 is unique.

data:

      a1  a2   a3   a4
c1    2    1    3   NA
c2    2    1    3    3
c3    2    1    4    3
c4    2    2    3   NA
c5    2    1    3    3
c6    2    NA   3   NA
c7    2    NA   0   NA
c8    2    3   NA   NA

I would like to have this output:

output:

     a1    a2  a3   a4
c2    2    1    3    3
c3    2    1    4    3
c4    2    2    3   NA
c7    2    NA   0   NA
c8    2    3   NA   NA

回答1:

library(stringr) 
df <- unique(df)
#paste rows omitting NAs
df$new <- apply(df, 1, function(i) paste(na.omit(i), collapse = ''))
#use str_detect to determine whether each pattern is found more than once
df$new2 <- rowSums(sapply(df$new, function(i) str_detect(i, df$new)))
new_df <- subset(df, df$new2 == 1)
new_df <- new_df[, !names(new_df) %in% c('new', 'new2')]
new_df
#   V2 V3 V4 V5
#2  2  1  3  3
#3  2  1  4  3
#4  2  2  3 NA

Testing the code with the additional row as per your comment:

new_df
#   a1 a2 a3 a4
#c2  2  1  3  3
#c3  2  1  4  3
#c4  2  2  3 NA
#c7  2 NA  0 NA

DATA

dput(df)
structure(list(a1 = c(2L, 2L, 2L, 2L, 2L, 2L, 2L), a2 = c(1L, 
1L, 1L, 2L, 1L, NA, NA), a3 = c(3L, 3L, 4L, 3L, 3L, 3L, 0L), 
    a4 = c(NA, 3L, 3L, NA, 3L, NA, NA)), .Names = c("a1", "a2", 
"a3", "a4"), class = "data.frame", row.names = c("c1", "c2", 
"c3", "c4", "c5", "c6", "c7"))

回答2:

My solution would be to :

1) Take all unique solutions in row that do not have a NA.

2) Among those that have NAs, see if the rest of the values is identical to another row

Reproduce data

df<-data.frame(V1 = rep(2,times = 6),
    V2 = c(1,1,1,2,1,NA),
    V3=c(3,3,4,3,3,3),
    V4=c(NA,3,3,NA,3,NA))

Create two unique data frames (one with NAs, the other without

df1<-unique(df[apply(df,MARGIN=1,FUN =function(z) sum(is.na(z)))==0,])
df2<-unique(df[apply(df,MARGIN=1,FUN =function(z) sum(is.na(z)))>0,])

Add rows from NAs matching your condition

for(i in 1:nrow(df2)){
  vec<-df2[i,] 
  w<-is.na(vec)
  if(nrow(merge(vec[!w],df1[,w]))>0){ ###I remove columns where you have NAs
    df1<-rbind(df1,vec)
  }

}

来源：https://stackoverflow.com/questions/36327084/find-uniqueness-in-data-frame-withe-rows-na

标签

dataframe

unique

uniqueidentifier

substitution