fuzzy and exact match of two databases

前端 未结 2 1951
春和景丽
春和景丽 2021-01-24 13:18

I have two databases. The first one has about 70k rows with 3 columns. the second one has 790k rows with 2 columns. Both databases have a common variable grantee_name

相关标签:
2条回答
  • 2021-01-24 13:47

    I haven't used foreach before but maybe the variable x is already the individual rows of zz1?

    Have you tried:

    stringdist_inner_join(x, zz2, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance")

    ?

    0 讨论(0)
  • 2021-01-24 14:00

    If you split (with base::split or dplyr::group_split) your uniquegrantees data frame into a list of data frames, then you can call purrr::map on the list. (map is pretty much lapply)

    purrr::map(list_of_dfs, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))

    Your result will be a list of data frames each fuzzyjoined with filings. You can then call bind_rows (or you could do map_dfr) to get all the results in the same data frame again.

    See R - Splitting a large dataframe into several smaller dateframes, performing fuzzyjoin on each one and outputting to a single dataframe

    0 讨论(0)
提交回复
热议问题