Find a series of data using non-exact measurements (fuzzy logic)

不羁的心 提交于 2019-12-05 05:47:50

If you take literally your song example, one approach is to boil down your input to a bit-vector fingerprint, and then look up that fingerprint in a database as an exact match. You can increase the chances of finding a good match by extracting several fingerprints from your input and/or trying e.g. all bit-vectors that are only 1 or bit-errors away from your fingerprint.

If you have access to the ACM digital library, you can read a description of this sort of approach in "The Shazam Music Recognition service" at acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744">http://delivery.acm.org/10.1145/1150000/1145312/p44-wang.pdf?ip=94.195.253.182&acc=ACTIVE%20SERVICE&CFID=53180383&CFTOKEN=41480065&acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744. There is also some information at http://www.music.mcgill.ca/~alastair/621/porter11fingerprint-summary.pdf.

The input format you describe suggests that you might be able to do something with the random projection method described in http://en.wikipedia.org/wiki/Locality_sensitive_hashing.

To answer your second question, depending on exactly what a position corresponds to, you might consider boiling down the numbers to hash fingerprints made up of bits or characters, and storing these in a text search database, such as Apache Lucene.

From the information you posted this can be solved with the edmond's blossom v perfect match algorithm. Either you can minimize or maximize the function and it will always find the best match. Maybe you can use a brute force solution with 2 loops. The wikipedia about edmond's matching algorithm: http://en.wikipedia.org/wiki/Edmonds%27s_matching_algorithm

You need to come up with a definition for "most closely match". I don't know how anyone here can help you with that since no one here is going to know the business requirements or intricacies of the data. Your two methods both sound reasonable, but I have no idea if they actually are or not.

As for whether or not a database is the correct medium for this kind of data, I'd say that a database is probably the perfect medium for the data, but it is very like not the correct medium for processing the data. Whether it's possible or not will depend on your final solution on what constitutes "most closely match".

As a quick note, SSIS has some fuzzy match capabilities built into it for processing data. I've only played around with it though and that was a couple of years ago, so I don't know if it would work for what you're doing or not.

Tom Hunter

Could you take the approach of matching the measurements against each segment position by position and calculating the difference for each position. Then slide the measurements along one position and calculate the difference. Then find which slide position scored the lowest difference. Do this for every product and then you know which product the measurements match to closest.

Test tables and data:

CREATE TABLE [dbo].[Segment]
(
    [ProductId] INT,
    [Position] INT,
    [Value] INT
)

INSERT  [dbo].[Segment]
VALUES  (1, 1, 300),
        (1, 2, 5000),
        (1, 3, 900),
        (1, 4, 400),
        (1, 5, 800),

        (2, 1, 400),
        (2, 2, 6000),
        (2, 3, 1000),
        (2, 4, 500),
        (2, 5, 900),

        (3, 1, 400),
        (3, 2, 5400),
        (3, 3, 900),
        (3, 4, 400),
        (3, 5, 900)

CREATE TABLE #Measurement
(
    [Position] INT,
    [Value] INT
)

INSERT  #Measurement
VALUES  (1, 5400),
        (2, 900),
        (3, 400)

As you can see, the measurements match (a subset of) the third product exactly.

Some helpers:

CREATE TABLE #ProductSegmentCount
(
    [ProductId] INT,
    [SegmentCount] INT
)

INSERT #ProductSegmentCount
SELECT [ProductId], MAX([Position])
FROM [dbo].[Segment]
GROUP BY [ProductId]

DECLARE @MeasurementSegmentCount INT = (SELECT MAX([Position]) FROM #Measurement)

A recursive common table expression to show the products ordered by closest match:

;WITH [cteRecursive] AS
(
    SELECT  s.[ProductId],
            0 AS [RecursionId],
            m.[Position] AS [MeasurementPosition],
            s.[Position] AS [SegmentPosition],
            ABS(m.[Value] - s.[Value]) AS [Difference]
    FROM #Measurement m
    INNER JOIN [dbo].[Segment] s 
        ON m.[Position] = s.[Position]
    UNION ALL
    SELECT s.[ProductId],
            [RecursionId] + 1 AS [RecursionId],
            m.[Position],
            s.[Position],
            ABS(m.[Value] - s.[Value]) AS [Difference]
    FROM [cteRecursive] r
    INNER JOIN #Measurement m
        ON m.[Position] = r.[MeasurementPosition]
    INNER JOIN [dbo].[Segment] s 
        ON r.[ProductId] = s.[ProductId]
        AND m.[Position] + (r.[RecursionId]) = s.[Position]
    INNER JOIN #ProductSegmentCount psc
        ON s.[ProductId] = psc.[ProductId]
    WHERE [RecursionId] <= ABS(@MeasurementSegmentCount - psc.[SegmentCount])
)-- select * from [cteRecursive] where [ProductId] = 3 order by RecursionId, SegmentPosition
, [cteDifferences] AS
(
    SELECT [ProductId], [RecursionId], SUM([Difference]) AS [Difference]
    FROM [cteRecursive]
    GROUP BY [ProductId], [RecursionId]
)-- select * from [cteDifferences]
SELECT [ProductId], MIN([Difference]) AS [Difference]
FROM [cteDifferences] 
GROUP BY [ProductId]
ORDER BY MIN([Difference])
OPTION (MAXRECURSION 0)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!