How to perform approximate (fuzzy) name matching in R

前端 未结 2 1301
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-02-06 02:09

I have a large data set, dedicated to biological journals, which was being composed for a long time by different people. So, the data are not in a single format. For example, in

2条回答
  •  我在风中等你
    2021-02-06 02:26

    There are packages that can help you with this, and some are listed in the comments. But, if you don't want to use these, I though I'd try to write something in R that might help you. The code will match "John Smith" with "J Smith", "John Smith", "Smith John", "John S". Meanwhile, it won't match something like "John Sally".

    # generate some random names
    names = c(
      "John Smith", 
      "Wigberht Ernust",
      "Samir Henning",
      "Everette Arron",
      "Erik Conor",
      "Smith J",
      "Smith John",
      "John S",
      "John Sally"
    );
    
    # split those names and get all ways to write that name
    split_names = lapply(
      X = names,
      FUN = function(x){
        print(x);
        # split by a space
        c_split = unlist(x = strsplit(x = x, split = " "));
        # get both combinations of c_split to compensate for order
        c_splits = list(c_split, rev(x = c_split));
        # return c_splits
        c_splits;
      }
    )
    
    # suppose we're looking for John Smith
    search_for = "John Smith";
    
    # split it by " " and then find all ways to write that name
    search_for_split = unlist(x = strsplit(x = x, split = " "));
    search_for_split = list(search_for_split, rev(x = search_for_split));
    
    # initialise a vector containing if search_for was matched in names
    match_statuses = c();
    
    # for each name that's been split
    for(i in 1:length(x = names)){
    
      # the match status for the current name
      match_status = FALSE;
    
      # the current split name
      c_split_name = split_names[[i]];
    
      # for each element in search_for_split
      for(j in 1:length(x = search_for_split)){
    
        # the current combination of name
        c_search_for_split_names = search_for_split[[j]];
    
        # for each element in c_split_name
        for(k in 1:length(x = c_split_name)){
    
          # the current combination of current split name
          c_c_split_name = c_split_name[[k]];
    
          # if there's a match, or the length of grep (a pattern finding function is
          # greater than zero)
          if(
            # is c_search_for_split_names first element in c_c_split_name first
            # element
            length(
              x = grep(
                pattern = c_search_for_split_names[1],
                x = c_c_split_name[1]
              )
            ) > 0 &&
            # is c_search_for_split_names second element in c_c_split_name second 
            # element
            length(
              x = grep(
                pattern = c_search_for_split_names[2],
                x = c_c_split_name[2]
              )
            ) > 0 ||
            # or, is c_c_split_name first element in c_search_for_split_names first 
            # element
            length(
              x = grep(
                pattern = c_c_split_name[1],
                x = c_search_for_split_names[1]
              )
            ) > 0 &&
            # is c_c_split_name second element in c_search_for_split_names second 
            # element
            length(
              x = grep(
                pattern = c_c_split_name[2],
                x = c_search_for_split_names[2]
              )
            ) > 0
          ){
            # if this is the case, update match status to TRUE
            match_status = TRUE;
          } else {
            # otherwise, don't update match status
          }
        }
      }
    
      # append match_status to the match_statuses list
      match_statuses = c(match_statuses, match_status);
    }
    
    search_for;
    
    [1] "John Smith"
    
    cbind(names, match_statuses);
    
         names             match_statuses
    [1,] "John Smith"      "TRUE"        
    [2,] "Wigberht Ernust" "FALSE"       
    [3,] "Samir Henning"   "FALSE"       
    [4,] "Everette Arron"  "FALSE"       
    [5,] "Erik Conor"      "FALSE"       
    [6,] "Smith J"         "TRUE"        
    [7,] "Smith John"      "TRUE"        
    [8,] "John S"          "TRUE"
    [9,] "John Sally"      "FALSE"   
    

    Hopefully this code can serve as a starting point, and you may wish to adjust it to work with names of arbitrary length.

    Some notes:

    • for loops in R can be slow. If you're working with lots of names, look into Rcpp.

    • You may wish to wrap this in a function. Then, you can apply this for different names by adjusting search_for.

    • There are time complexity issues with this example, and depending on the size of your data, you may want/need to rework it.

提交回复
热议问题