What are ways to match street addresses in SQL Server?

前端 未结 8 2049
时光说笑
时光说笑 2021-01-01 03:44

We have a column for street addresses:

123 Maple Rd.
321 1st Ave.
etc...

Is there any way to match these addresses t

相关标签:
8条回答
  • 2021-01-01 04:15

    I think the first step for you is to better define how generous or not you're going to be regarding differing addresses. For example, which of these match and which don't:

    123 Maple Street
    123 Maple St
    123 maple street
    123 mpale street
    123 maple
    123. maple st
    123 N maple street
    123 maple ave
    123 maple blvd
    

    Are there both a Maple Street and a Maple Blvd in the same area? What about Oak Street vs Oak Blvd.

    For example, where I live there many streets/roads/blvds/ave that are all named Owasso. I live on Owasso Street, which connects to North Owasso Blvd, which connects to South Owasso Blvd. However, there is only one Victoria Ave.

    Given that reality, you must either have a database of all road names, and look for the closest road (and deal with the number seperately)

    OR

    Make an decision ahead of time what you'll insist on and what you won't.

    0 讨论(0)
  • 2021-01-01 04:16

    You may want to consider using the Levenshtein Distance algorithm.

    You can create it as a user-defined function in SQL Server, where it will return the number of operations that need to be performed on String_A so that it becomes String_B. You can then compare the result of the Levenshtein Distance function against some fixed threshold, or against some value derived from the length of the strings.

    You would simply use it as follows:

    ... WHERE LEVENSHTEIN(address_in_db, address_to_search) < 5;
    

    As Mark Byers suggested, converting variable terms into canonical form will help if you use Levenshtein Distance.

    Using Full-Text Search may be another option, especially since Levenshtein would normally require a full table scan. This decision may depend on how frequently you intend to do these queries.

    You may want to check out the following Levenshtein Distance implementation for SQL Server:

    • Levenshtein Distance Algorithm: TSQL Implementation

    Note: You would need to implement a MIN3 function for the above implementation. You can use the following:

    CREATE FUNCTION MIN3(@a int, @b int,  @c int)
    RETURNS int
    AS
    BEGIN
        DECLARE @m INT
        SET @m = @a
    
        IF @b < @m SET @m = @b
        IF @c < @m SET @m = @c
    
        RETURN @m
    END
    

    You may also be interested in checking out the following articles:

    • Address Geocoding with Fuzzy String Matching [Uses Levenshtein Distance]
    • Stack Overflow - Strategies for finding duplicate mailing addresses
    • Merge/Purge and Duplicate Detection
    0 讨论(0)
提交回复
热议问题