Is there a way to measure string similarity in Google BigQuery

前端 未结 7 2252
礼貌的吻别
礼貌的吻别 2020-12-03 15:35

I\'m wondering if anyone knows of a way to measure string similarity in BigQuery.

Seems like would be a neat function to have.

My case is i need to compare

相关标签:
7条回答
  • 2020-12-03 16:12

    While I was looking for the answer Felipe above, I worked on my own query and ended up with two versions, one which I called string approximation and another string resemblance.

    The first is looking at the shortest distance between letters of source string and test string and returns a score between 0 and 1 where 1 is a complete match. It will always score based on the longest string of the two. It turns out to return similar results to the Levensthein distance.

    #standardSql
    CREATE OR REPLACE FUNCTION `myproject.func.stringApproximation`(sourceString STRING, testString STRING) AS (
    (select avg(best_result) from (
                                  select if(length(testString)<length(sourceString), sourceoffset, testoffset) as ref, 
                                  case 
                                    when min(result) is null then 0
                                    else 1 / (min(result) + 1) 
                                  end as best_result,
                                  from (
                                           select *,
                                                  if(source = test, abs(sourceoffset - (testoffset)),
                                                  greatest(length(testString),length(sourceString))) as result
                                           from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
                                                    cross join
                                                (select *
                                                 from unnest(split(lower(testString),'')) as test with offset as testoffset)
                                           ) as results
                                  group  by ref
                                     )
            )
    );
    

    The second is a variation of the first, where it will look at sequences of matching distances, so that a character matching at equal distance from the character preceding or following it will count as one point. This works quite well, better than string approximation but not quite as well as I would like to (see example output below).

        #standarSql
        CREATE OR REPLACE FUNCTION `myproject.func.stringResemblance`(sourceString STRING, testString STRING) AS (
    (
    select avg(sequence)
    from (
          select ref,
                 if(array_length(array(select * from comparison.collection intersect distinct
                                       (select * from comparison.before))) > 0
                        or array_length(array(select * from comparison.collection intersect distinct
                                              (select * from comparison.after))) > 0
                     , 1, 0) as sequence
    
          from (
                   select ref,
                          collection,
                          lag(collection) over (order by ref)  as before,
                          lead(collection) over (order by ref) as after
                   from (
                         select if(length(testString) < length(sourceString), sourceoffset, testoffset) as ref,
                                array_agg(result ignore nulls)                                          as collection
                         from (
                                  select *,
                                         if(source = test, abs(sourceoffset - (testoffset)), null) as result
                                  from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
                                           cross join
                                       (select *
                                        from unnest(split(lower(testString),'')) as test with offset as testoffset)
                                  ) as results
                         group by ref
                            )
                   ) as comparison
          )
    
    )
    );
    

    Now here is a sample of result:

    #standardSQL
    with test_subjects as (
      select 'benji' as name union all
      select 'benjamin' union all
      select 'benjamin alan artis' union all
      select 'ben artis' union all
      select 'artis benjamin' 
    )
    
    select name, quick.stringApproximation('benjamin artis', name) as approxiamtion, quick.stringResemblance('benjamin artis', name) as resemblance
    from test_subjects
    
    order by resemblance desc
    

    This returns

    +---------------------+--------------------+--------------------+
    | name                | approximation      | resemblance        |
    +---------------------+--------------------+--------------------+
    | artis benjamin      | 0.2653061224489796 | 0.8947368421052629 |
    +---------------------+--------------------+--------------------+
    | benjamin alan artis | 0.6078947368421053 | 0.8947368421052629 |
    +---------------------+--------------------+--------------------+
    | ben artis           | 0.4142857142857142 | 0.7142857142857143 |
    +---------------------+--------------------+--------------------+
    | benjamin            | 0.6125850340136053 | 0.5714285714285714 |
    +---------------------+--------------------+--------------------+
    | benji               | 0.36269841269841263| 0.28571428571428575|
    +----------------------------------------------------------------
    

    Edited: updated the resemblance algorithm to improve results.

    0 讨论(0)
  • 2020-12-03 16:14

    Try Flookup for Google Sheets... it's definitely faster than Levenshtein distance and it calculates percentage similarities right out of the box. One Flookup function you might find useful is this:

    FUZZYMATCH (string1, string2)

    Parameter Details

    1. string1: compares to string2.
    2. string2: compares to string1.

    The percentage similarity is then calculated based on these comparisons. Both parameters can be ranges.

    I'm currently trying to optimise it for large data sets so you feedback would be very welcome.

    Edit: I'm the creator of Flookup.

    0 讨论(0)
  • 2020-12-03 16:17

    Levenshtein via JS would be the way to go. You can use the algorithm to get absolute string distance, or convert it to a percentage similarity by simply calculating abs(strlen - distance / strlen).

    The easiest way to implement this would be to define a Levenshtein UDF that takes two inputs, a and b, and calculates the distance between them. The function could return a, b, and the distance.

    To invoke it, you'd then pass in the two URLs as columns aliased to 'a' and 'b':

    SELECT a, b, distance
    FROM
      Levenshtein(
         SELECT
           some_url AS a, other_url AS b
         FROM
           your_table
      )
    
    0 讨论(0)
  • 2020-12-03 16:19

    If you're familiar with Python, you can use the functions defined by fuzzywuzzy in BigQuery using external libraries loaded from GCS.

    Steps:

    1. Download the javascript version of fuzzywuzzy (fuzzball)
    2. Take the compiled file of the library: dist/fuzzball.umd.min.js and rename it to a clearer name (like fuzzball)
    3. Upload it to a google cloud storage bucket
    4. Create a temp function to use the lib in your query (set the path in OPTIONS to the relevant path)
    CREATE TEMP FUNCTION token_set_ratio(a STRING, b STRING)
    RETURNS FLOAT64
    LANGUAGE js AS """
      return fuzzball.token_set_ratio(a, b);
    """
    OPTIONS (
      library="gs://my-bucket/fuzzball.js");
    
    with data as (select "my_test_string" as a, "my_other_string" as b)
    
    SELECT  a, b, token_set_ratio(a, b) from data
    
    0 讨论(0)
  • 2020-12-03 16:20

    Below is quite simpler version for Hamming Distance by using WITH OFFSET instead of ROW_NUMBER() OVER()

    #standardSQL
    WITH Input AS (
      SELECT 'abcdef' AS strings UNION ALL
      SELECT 'defdef' UNION ALL
      SELECT '1bcdef' UNION ALL
      SELECT '1bcde4' UNION ALL
      SELECT '123de4' UNION ALL
      SELECT 'abc123'
    )
    SELECT 'abcdef' AS target, strings, 
      (SELECT COUNT(1) 
        FROM UNNEST(SPLIT('abcdef', '')) a WITH OFFSET x
        JOIN UNNEST(SPLIT(strings, '')) b WITH OFFSET y
        ON x = y AND a != b) hamming_distance
    FROM Input
    
    0 讨论(0)
  • 2020-12-03 16:22

    I couldn't find a direct answer to this, so I propose this solution, in standard SQL

    #standardSQL
    CREATE TEMP FUNCTION HammingDistance(a STRING, b STRING) AS (
      (
      SELECT
        SUM(counter) AS diff
      FROM (
        SELECT
          CASE
            WHEN X.value != Y.value THEN 1
            ELSE 0
          END AS counter
        FROM (
          SELECT
            value,
            ROW_NUMBER() OVER() AS row
          FROM
            UNNEST(SPLIT(a, "")) AS value ) X
        JOIN (
          SELECT
            value,
            ROW_NUMBER() OVER() AS row
          FROM
            UNNEST(SPLIT(b, "")) AS value ) Y
        ON
          X.row = Y.row )
       )
    );
    
    WITH Input AS (
      SELECT 'abcdef' AS strings UNION ALL
      SELECT 'defdef' UNION ALL
      SELECT '1bcdef' UNION ALL
      SELECT '1bcde4' UNION ALL
      SELECT '123de4' UNION ALL
      SELECT 'abc123'
    )
    
    SELECT strings, 'abcdef' as target, HammingDistance('abcdef', strings) as hamming_distance
    FROM Input;
    

    Compared to other solutions (like this one), it takes two strings (of the same length, following the definition for hamming distance) and outputs the expected distance.

    bigquery similarity standardsql hammingdistance

    0 讨论(0)
提交回复
热议问题