Find a series of data using non-exact measurements (fuzzy logic)

This is a more complex follow-up question to: Efficient way to look up sequential values

Each Product can have many Segment rows (thousands). Each segment has position column that starts at 1 for each product (1, 2, 3, 4, 5, etc.) and a value column that can contain any values such as (323.113, 5423.231, 873.42, 422.64, 763.1, etc.). The data is read-only.

It may help to think of the product as a song and the segments as a set of musical notes in the song.

Given a subset of contiguous segments, like a snippet of a song, I would like to identify potential matches for products. However, due to potential errors in measurements, the segments in the subset may not match the segments in the database exactly.

How can I identify product candidates by finding the segments of products which most closely match the subset of segments I have measured? Also, is a database the best medium for this type of data?

-

Here are just some thoughts for how I was about to approach this problem. Please don't take these as exact requirements. I am open to any kind of algorithms to make this work as best as possible. I was thinking there needs to be multiple threshold variables for determining closeness. One possibility might be to implement a proximity threshold and a match threshold.

For example, given these values:

Product A contains these segments: 11,21,13,13,15.
Measurement 1 has captured: 20,14,14,15.
Measurement 2 has captured: 11,21,78,13.
Measurement 3 has captured: 15,13,21,13,11.

If a proximity threshold allowed the measured segment to be 1 above or below the actual segment, then Measurement 1 may match Product A because, although many segments do not match exactly, they are within the proximity threshold relative to the actual values.

If a match threshold allowed for measurements with matches of 3 or more, Measurement 2 may return Product A because, although one of the segments (78) far exceeds the proximity threshold, it still matches 3 segments in the correct order and so is within the match threshold.

Measurement 3 would not match Product A because, although all measured segments exist in the actual segments, they are not within the proximity or match thresholds.

Update: One of the answers asked me to define what I mean by most closely match. I'm not exactly sure how to answer that, but I'll try to explain by continuing with the song analogy. Let's say the segments represent maximum frequencies of a recorded song. If I record that same song again it will be similar, but due to background noise and other limitations of recording equipment, some of the frequencies will match, some will be close, and a few will be way off. In this scenario, how would you define when one recording "matches" another? That's the same kind of matching logic I'm looking for to use in this problem.

If you take literally your song example, one approach is to boil down your input to a bit-vector fingerprint, and then look up that fingerprint in a database as an exact match. You can increase the chances of finding a good match by extracting several fingerprints from your input and/or trying e.g. all bit-vectors that are only 1 or bit-errors away from your fingerprint.

If you have access to the ACM digital library, you can read a description of this sort of approach in "The Shazam Music Recognition service" at acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744">http://delivery.acm.org/10.1145/1150000/1145312/p44-wang.pdf?ip=94.195.253.182&acc=ACTIVE%20SERVICE&CFID=53180383&CFTOKEN=41480065&acm=1321038137_73cd62cf2b16cd73ca9070e7d5ea0744. There is also some information at http://www.music.mcgill.ca/~alastair/621/porter11fingerprint-summary.pdf.

The input format you describe suggests that you might be able to do something with the random projection method described in http://en.wikipedia.org/wiki/Locality_sensitive_hashing.

To answer your second question, depending on exactly what a position corresponds to, you might consider boiling down the numbers to hash fingerprints made up of bits or characters, and storing these in a text search database, such as Apache Lucene.

From the information you posted this can be solved with the edmond's blossom v perfect match algorithm. Either you can minimize or maximize the function and it will always find the best match. Maybe you can use a brute force solution with 2 loops. The wikipedia about edmond's matching algorithm: http://en.wikipedia.org/wiki/Edmonds%27s_matching_algorithm

You need to come up with a definition for "most closely match". I don't know how anyone here can help you with that since no one here is going to know the business requirements or intricacies of the data. Your two methods both sound reasonable, but I have no idea if they actually are or not.

As for whether or not a database is the correct medium for this kind of data, I'd say that a database is probably the perfect medium for the data, but it is very like not the correct medium for processing the data. Whether it's possible or not will depend on your final solution on what constitutes "most closely match".

As a quick note, SSIS has some fuzzy match capabilities built into it for processing data. I've only played around with it though and that was a couple of years ago, so I don't know if it would work for what you're doing or not.

Tom Hunter

Could you take the approach of matching the measurements against each segment position by position and calculating the difference for each position. Then slide the measurements along one position and calculate the difference. Then find which slide position scored the lowest difference. Do this for every product and then you know which product the measurements match to closest.

Test tables and data:

CREATE TABLE [dbo].[Segment]
(
    [ProductId] INT,
    [Position] INT,
    [Value] INT
)

INSERT  [dbo].[Segment]
VALUES  (1, 1, 300),
        (1, 2, 5000),
        (1, 3, 900),
        (1, 4, 400),
        (1, 5, 800),

        (2, 1, 400),
        (2, 2, 6000),
        (2, 3, 1000),
        (2, 4, 500),
        (2, 5, 900),

        (3, 1, 400),
        (3, 2, 5400),
        (3, 3, 900),
        (3, 4, 400),
        (3, 5, 900)

CREATE TABLE #Measurement
(
    [Position] INT,
    [Value] INT
)

INSERT  #Measurement
VALUES  (1, 5400),
        (2, 900),
        (3, 400)

As you can see, the measurements match (a subset of) the third product exactly.

Some helpers:

CREATE TABLE #ProductSegmentCount
(
    [ProductId] INT,
    [SegmentCount] INT
)

INSERT #ProductSegmentCount
SELECT [ProductId], MAX([Position])
FROM [dbo].[Segment]
GROUP BY [ProductId]

DECLARE @MeasurementSegmentCount INT = (SELECT MAX([Position]) FROM #Measurement)

A recursive common table expression to show the products ordered by closest match:

;WITH [cteRecursive] AS
(
    SELECT  s.[ProductId],
            0 AS [RecursionId],
            m.[Position] AS [MeasurementPosition],
            s.[Position] AS [SegmentPosition],
            ABS(m.[Value] - s.[Value]) AS [Difference]
    FROM #Measurement m
    INNER JOIN [dbo].[Segment] s 
        ON m.[Position] = s.[Position]
    UNION ALL
    SELECT s.[ProductId],
            [RecursionId] + 1 AS [RecursionId],
            m.[Position],
            s.[Position],
            ABS(m.[Value] - s.[Value]) AS [Difference]
    FROM [cteRecursive] r
    INNER JOIN #Measurement m
        ON m.[Position] = r.[MeasurementPosition]
    INNER JOIN [dbo].[Segment] s 
        ON r.[ProductId] = s.[ProductId]
        AND m.[Position] + (r.[RecursionId]) = s.[Position]
    INNER JOIN #ProductSegmentCount psc
        ON s.[ProductId] = psc.[ProductId]
    WHERE [RecursionId] <= ABS(@MeasurementSegmentCount - psc.[SegmentCount])
)-- select * from [cteRecursive] where [ProductId] = 3 order by RecursionId, SegmentPosition
, [cteDifferences] AS
(
    SELECT [ProductId], [RecursionId], SUM([Difference]) AS [Difference]
    FROM [cteRecursive]
    GROUP BY [ProductId], [RecursionId]
)-- select * from [cteDifferences]
SELECT [ProductId], MIN([Difference]) AS [Difference]
FROM [cteDifferences] 
GROUP BY [ProductId]
ORDER BY MIN([Difference])
OPTION (MAXRECURSION 0)

来源：https://stackoverflow.com/questions/8042327/find-a-series-of-data-using-non-exact-measurements-fuzzy-logic

标签

sql

sql-server

algorithm

sql-server-2008

fuzzy-search