Compare characters and return mismatches In R

十年热恋 提交于 2021-01-28 12:11:50

问题


I want to compare characters iteratively and return mismatches between 2 columns of a data frame.

It should not return if x2x, y67y, as x remains x and y remains as y.

Input:

x y    x_val              y_val
A  B   x2x, y67h, d7j  x2y, y67y, d7r
B  C   x2y, y67y, d7r  x2y, y67y, d7r
C  A   x2y, y67y, d7r  x2x, y67h, d7j  
C  D   x2y, y67y, d7r  x67b, g72v, b8c
D  E   x67b, g72v, b8c  x67r, g72j

I want to add a column val and return differences between x_val and y_val

Output:

x y       x_val             y_val           val
A  B   x2x, y67h, d7j  x2y, y67y, d7r     x2y, d7r
B  C   x2y, y67y, d7r  x2y, y67y, d7r     NA
C  A   x2y, y67y, d7r  x2x, y67h, d7j     y67h, d7j
C  D   x2y, y67y, d7r  y67b, g72v, b8c    y67b, g72v, b8c
D  E   y67b, g72v, b8c  y67b, g72j        g72j

I tried xy_val <- y_val[!(y_val %in% x_val)]

Could you please suggest solution on how to output mismatches.

My data:

structure(list(x = c("A", "B", "C", "C", "D"), y = c("B", "C", "A", "D", "E"), x_val = c("x2x, y67h, d7j", "x2y, y67y, d7r", "x2y, y67y, d7r", "x2y, y67y, d7r", "y67b, g72v, b8c"), y_val = c("x2y, y67y, d7r", "x2y, y67y, d7r", "x2x, y67h, d7j", "y67b, g72v, b8c", "y67b, g72j" )), class = "data.frame", row.names = c(NA, -5L))

I appreciate your help!

Thanks


回答1:


With dplyr and purrr:

library(dplyr)
library(purrr)

f %>% mutate(diff_x = map2_chr(strsplit(x_val, split = ", "), 
                               strsplit(y_val, split = ", "), 
                               ~paste(grep('([a-z])(?>\\d+)(?!\\1)', setdiff(.x, .y), 
                                           value = TRUE, perl = TRUE), 
                                           collapse = ", ")) %>%
               replace(. == "", NA), 
             diff_y = map2_chr(strsplit(x_val, split = ", "), 
                               strsplit(y_val, split = ", "), 
                               ~paste(grep('([a-z])(?>\\d+)(?!\\1)', setdiff(.y, .x), 
                                           value = TRUE, perl = TRUE),
                                           collapse = ", ")) %>%
               replace(. == "", NA))

Notes:

  1. grep takes the output of setdiff and removes any element with the format "same characters with digits in between"

  2. ([a-z]) matches any alpha characters.

  3. (?>\\d+) is an atomic group that matches digits of any length but does not backtrack.

  4. (?!\\1) is a negative lookahead that matches whatever was matched by ([a-z])

Output:

  x y           x_val           y_val    diff_x          diff_y
1 A B  x2x, y67h, d7j  x2y, y67y, d7r y67h, d7j        x2y, d7r
2 B C  x2y, y67y, d7r  x2y, y67y, d7r      <NA>            <NA>
3 C A  x2y, y67y, d7r  x2x, y67h, d7j  x2y, d7r       y67h, d7j
4 C D  x2y, y67y, d7r y67b, g72v, b8c  x2y, d7r y67b, g72v, b8c
5 D E y67b, g72v, b8c      y67b, g72j g72v, b8c            g72j



回答2:


does this deliver the desired results?

check_this = function(temp_data)
{
  print(temp_data)

  string_1 = gsub(", ", " ",   temp_data["x_val"])
  string_2 = gsub(", ", " ",   temp_data["y_val"])

  string_sub_1 = gsub(" ", "|", string_1)
  string_sub_2 = gsub(" ", "|", string_2)

  unmatche_s1 = gsub(string_sub_2, "", string_1)
  unmatche_s2 = gsub(string_sub_1, "", string_2)

  # return both as a list - if you need only unmachtedy_in_x you can just return(unmatched_s2)
  return(list(unmatchedx_in_y = unmatche_s1, unmatchedy_in_x = unmatche_s2))

}

res = apply(f, 1, check_this)


来源:https://stackoverflow.com/questions/55284306/compare-characters-and-return-mismatches-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!