R Function to identify non-matching rows

问题

I am trying to compare 2 data.frames, "V1" represents my CRM, "V2" represents Leads that I would like to send out.

'V1 has roughly 8k elements' 'V2 has roughly 25k elements'

I need to compare every row in V2 to every row in V1, discard every instance where a V2 element exists in V1.

I would then like to return only the elements that do not appear either exactly or loosely in V1 into the Leads column.

The goal is to send out a lead(V2) that does not exist in CRM(V1).

I've made some good progress with the stringdist package and divided 'soundex' by 'osa' to better my chances although this method still returns elements in V1.:(

This is the expected result I'm looking for in the Leads column, based on this example:

Leads: J.Jones Restoration A.W. Builders C&C Contractors

Any help would be greatly appreciated and I apologize if this is unclear in any way.

library(reprex)
library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.6.2
library(tidystringdist)

df <- tibble::tribble(
  ~V1,  ~V2,
  "5th Generation Builder", "5th Generation Builder, LLC",
  "5th Generation Builders Inc.",   "5th Generation Builders",
  "89 Contractors LLC", "89 Contractors LLC",
  "906 Studio Architects LLC",  "906 Studio Architects",
  "A & A Glass Co.",    "Paragon Const.",
  "A & E Farm", "A & E Farm",
  "A & H GLASS",    "C & C Contractors",
  "A & J Homeworks,Painting, and Restoration",  "A.W. Builders",
  "Paragon Const.", "J. Jones Restoration",
  "A & L Construction", "A & L Const.")

tidy_e <- tidy_stringdist(df) %>% 
  filter(soundex>=1) %>% 
  select(-V1, V2) %>% 
  arrange(V2,osa) %>% 
  mutate(V2, sim = soundex/ osa) %>% 
  distinct(V2, osa, soundex, sim) %>% 
  rename('Leads'= 'V2')

^{Created on 2020-04-13 by the reprex package (v0.3.0)}

回答1:

You can use the fuzzyjoin package, designed for joining tables based on inexact matching such as string distance. (Disclaimer is that I'm the maintainer).

If you had your data in two separate tables V1 and V2:

V1 <- tibble(name = c("5th Generation Builder", "5th Generation Builders Inc.", "89 Contractors LLC", 
                      "906 Studio Architects LLC", "A & A Glass Co.", "A & E Farm", 
                      "A & H GLASS", "A & J Homeworks,Painting, and Restoration", "Paragon Const.", 
                      "A & L Construction"))

V2 <- tibble(name = c("5th Generation Builder, LLC", "5th Generation Builders", "89 Contractors LLC", 
                      "906 Studio Architects", "Paragon Const.", "A & E Farm", "C & C Contractors", 
                      "A.W. Builders", "J. Jones Restoration", "A & L Const."))

Then you could use stringdist_anti_join() to find the ones in V2 that don't have a soundex match in V1:

V2 %>%
  stringdist_anti_join(V1, by = "name", method = "soundex")

Results:

# A tibble: 3 x 1
  name                
  <chr>               
1 C & C Contractors   
2 A.W. Builders       
3 J. Jones Restoration

See this vignette for more on the stringdist_ joins.

Note that if you'd wanted to see which each matched to, you could use stringdist_left_join():

V2 %>%
  stringdist_left_join(V1, by = "name", method = "soundex")

# A tibble: 12 x 2
   name.x                      name.y                      
   <chr>                       <chr>                       
 1 5th Generation Builder, LLC 5th Generation Builder      
 2 5th Generation Builder, LLC 5th Generation Builders Inc.
 3 5th Generation Builders     5th Generation Builder      
 4 5th Generation Builders     5th Generation Builders Inc.
 5 89 Contractors LLC          89 Contractors LLC          
 6 906 Studio Architects       906 Studio Architects LLC   
 7 Paragon Const.              Paragon Const.              
 8 A & E Farm                  A & E Farm                  
 9 C & C Contractors           NA                          
10 A.W. Builders               NA                          
11 J. Jones Restoration        NA                          
12 A & L Const.                A & L Construction

来源：https://stackoverflow.com/questions/61196785/r-function-to-identify-non-matching-rows

标签

tidyverse

stringdist