SQL - similar data in column

前端未结

关注

 3  687

野的像风

Is there any way to find similar results in column. Example:

I want query return from table data without 4 green tree because there is no similar data to g

相关标签:

3条回答

谎友^

2021-01-20 16:06

You could use SOUNDEX to do this.

Sample data;

CREATE TABLE #SampleData (Column1 int, Column2 varchar(10))
INSERT INTO #SampleData (Column1, Column2)
VALUES
(1,'blue car')
,(2,'red doll')
,(3,'blue cars')
,(4,'green tree')
,(5,'red dolly')

The following code will use soundex to create a list of similar entries in column2. It then uses a different sub query to see how many occurrences of that soundex field appear;

SELECT
a.GroupingField
,a.Title
,b.SimilarFields
FROM (
        SELECT
        SOUNDEX(Column2) GroupingField
        ,MAX(Column2) Title --Just return a unique title for this soundex group
        FROM #SampleData
        GROUP BY SOUNDEX(Column2)
      ) a
LEFT JOIN   (
                SELECT
                SOUNDEX(Column2) GroupingField
                ,COUNT(Column2) SimilarFields --How many fields are in the soundex group?
                FROM #SampleData
                GROUP BY SOUNDEX(Column2)
            ) b
ON a.GroupingField = b.GroupingField
WHERE b.SimilarFields > 1

The results look like this (I've left the soundex field in to show you what it looks like);

GroupingField   Title       SimilarFields
B400            blue cars   2
R300            red dolly   2

Some further reading on soundex https://msdn.microsoft.com/en-gb/library/ms187384.aspx

Edit: as per your request, to get the original data you may as well push into a temp table, change the query i've given you to put an INTO before the FROM statement;

SELECT
a.GroupingField
,a.Title
,b.SimilarFields
INTO #Duplicates
FROM (
        SELECT
        SOUNDEX(Column2) GroupingField
        ,MAX(Column2) Title --Just return a unique title for this soundex group
        FROM #SampleData
        GROUP BY SOUNDEX(Column2)
      ) a
LEFT JOIN   (
                SELECT
                SOUNDEX(Column2) GroupingField
                ,COUNT(Column2) SimilarFields --How many fields are in the soundex group?
                FROM #SampleData
                GROUP BY SOUNDEX(Column2)
            ) b
ON a.GroupingField = b.GroupingField
WHERE b.SimilarFields > 1

Then use the following query to link back to your original data;

SELECT
a.GroupingField
,a.Title
,a.SimilarFields
,b.Column1
,b.Column2
FROM #Duplicates a
JOIN #SampleData b
ON a.GroupingField = SOUNDEX(b.Column2)
ORDER BY a.GroupingField

Would give the following result;

GroupingField   Title       SimilarFields   Column1     Column2
B400            blue cars   2               1           blue car
B400            blue cars   2               3           blue cars
R300            red dolly   2               5           red dolly
R300            red dolly   2               2           red doll

Remember to

DROP TABLE #Differences

0 讨论(0)

鱼传尺愫

2021-01-20 16:06

This approach uses a very basic notion of similarity but can be extended to a better definition. It's not very efficient, mind you. The count(1) + 1 includes the base phrase.

create table phrases ( phrase varchar(max) )
insert phrases values( 'blue car' ), ( 'blue cars' ), ('green tree' ), ( 'red doll' ), ( 'red dolly' )

create function dbo.fnSimilar( @s1 varchar(max), @s2 varchar(max) )
returns int
begin
    if @s1 = @s2 return 0 -- a phrase is not similar to itself
    if @s1 like @s2 + '%' return 1
    if @s2 like @s1 + '%' return 2
    return 0
end

select x.phrase, similar = count(1) + 1 from 
(
    select p1.phrase from phrases p1
    inner join phrases p2 on dbo.fnSimilar( p2.phrase, p1.phrase ) = 1
) x
group by x.phrase

Result:

phrase      similar
--------    -------
blue car    2
red doll    2

0 讨论(0)

一个人的身影

2021-01-20 16:19

As Gar rightfully commented, you have to define what do you mean by "similarity". But if all you need is just some fixed number (8 in your example) of equal characters, you can do the following :

create table myTest
(
    id int,
    name varchar(20)
);

insert into myTest values(1, 'blue car');
insert into myTest values(2, 'red doll');
insert into myTest values(3, 'blue cars');
insert into myTest values(4, 'green tree');
insert into myTest values(5, 'red dolly');

select left(name,8), count(*) 
from myTest 
group by left(name,8) 
having count(*) > 1;

0 讨论(0)