SQL - similar data in column

前端 未结 3 687
野的像风
野的像风 2021-01-20 15:52

Is there any way to find similar results in column. Example:

I want query return from table data without 4 green tree because there is no similar data to g

相关标签:
3条回答
  • 2021-01-20 16:06

    You could use SOUNDEX to do this.

    Sample data;

    CREATE TABLE #SampleData (Column1 int, Column2 varchar(10))
    INSERT INTO #SampleData (Column1, Column2)
    VALUES
    (1,'blue car')
    ,(2,'red doll')
    ,(3,'blue cars')
    ,(4,'green tree')
    ,(5,'red dolly')
    

    The following code will use soundex to create a list of similar entries in column2. It then uses a different sub query to see how many occurrences of that soundex field appear;

    SELECT
    a.GroupingField
    ,a.Title
    ,b.SimilarFields
    FROM (
            SELECT
            SOUNDEX(Column2) GroupingField
            ,MAX(Column2) Title --Just return a unique title for this soundex group
            FROM #SampleData
            GROUP BY SOUNDEX(Column2)
          ) a
    LEFT JOIN   (
                    SELECT
                    SOUNDEX(Column2) GroupingField
                    ,COUNT(Column2) SimilarFields --How many fields are in the soundex group?
                    FROM #SampleData
                    GROUP BY SOUNDEX(Column2)
                ) b
    ON a.GroupingField = b.GroupingField
    WHERE b.SimilarFields > 1
    

    The results look like this (I've left the soundex field in to show you what it looks like);

    GroupingField   Title       SimilarFields
    B400            blue cars   2
    R300            red dolly   2
    

    Some further reading on soundex https://msdn.microsoft.com/en-gb/library/ms187384.aspx

    Edit: as per your request, to get the original data you may as well push into a temp table, change the query i've given you to put an INTO before the FROM statement;

    SELECT
    a.GroupingField
    ,a.Title
    ,b.SimilarFields
    INTO #Duplicates
    FROM (
            SELECT
            SOUNDEX(Column2) GroupingField
            ,MAX(Column2) Title --Just return a unique title for this soundex group
            FROM #SampleData
            GROUP BY SOUNDEX(Column2)
          ) a
    LEFT JOIN   (
                    SELECT
                    SOUNDEX(Column2) GroupingField
                    ,COUNT(Column2) SimilarFields --How many fields are in the soundex group?
                    FROM #SampleData
                    GROUP BY SOUNDEX(Column2)
                ) b
    ON a.GroupingField = b.GroupingField
    WHERE b.SimilarFields > 1
    

    Then use the following query to link back to your original data;

    SELECT
    a.GroupingField
    ,a.Title
    ,a.SimilarFields
    ,b.Column1
    ,b.Column2
    FROM #Duplicates a
    JOIN #SampleData b
    ON a.GroupingField = SOUNDEX(b.Column2)
    ORDER BY a.GroupingField
    

    Would give the following result;

    GroupingField   Title       SimilarFields   Column1     Column2
    B400            blue cars   2               1           blue car
    B400            blue cars   2               3           blue cars
    R300            red dolly   2               5           red dolly
    R300            red dolly   2               2           red doll
    

    Remember to

    DROP TABLE #Differences
    
    0 讨论(0)
  • 2021-01-20 16:06

    This approach uses a very basic notion of similarity but can be extended to a better definition. It's not very efficient, mind you. The count(1) + 1 includes the base phrase.

    create table phrases ( phrase varchar(max) )
    insert phrases values( 'blue car' ), ( 'blue cars' ), ('green tree' ), ( 'red doll' ), ( 'red dolly' )
    
    create function dbo.fnSimilar( @s1 varchar(max), @s2 varchar(max) )
    returns int
    begin
        if @s1 = @s2 return 0 -- a phrase is not similar to itself
        if @s1 like @s2 + '%' return 1
        if @s2 like @s1 + '%' return 2
        return 0
    end
    
    select x.phrase, similar = count(1) + 1 from 
    (
        select p1.phrase from phrases p1
        inner join phrases p2 on dbo.fnSimilar( p2.phrase, p1.phrase ) = 1
    ) x
    group by x.phrase
    

    Result:

    phrase      similar
    --------    -------
    blue car    2
    red doll    2
    
    0 讨论(0)
  • 2021-01-20 16:19

    As Gar rightfully commented, you have to define what do you mean by "similarity". But if all you need is just some fixed number (8 in your example) of equal characters, you can do the following :

    create table myTest
    (
        id int,
        name varchar(20)
    );
    
    insert into myTest values(1, 'blue car');
    insert into myTest values(2, 'red doll');
    insert into myTest values(3, 'blue cars');
    insert into myTest values(4, 'green tree');
    insert into myTest values(5, 'red dolly');
    
    select left(name,8), count(*) 
    from myTest 
    group by left(name,8) 
    having count(*) > 1;
    
    0 讨论(0)
提交回复
热议问题