Addresses stored in SQL server have many small variations(errors)

问题

I have a table in my database which stores packing slips and their information. I'm trying to query that table and get each unique address. I've come close, but I still have many near misses and I'm looking for a way to exclude these near duplicates from my select.

Sample Data

CompanyCode   CompanyName                     Addr1                City         State   Zip
10033         UNITED DIE  CUTTING & FINISHIN  3610 HAMILTON AVE    CLEVELAND    Ohio    44114
10033         UNITED DIE CUTTING & FINISHING  3610 HAMILTON AVE    CLEVELAND    Ohio    44114
10033         UNITED DIE CUTTING & FINISHING  3610 HAMILTON AVE.   CLEVELAND    Ohio    44114
10033         UNITED DIE CUTTING & FINISHING  3610 HAMILTON AVENUE CLEVELAND    Ohio    44114
10033         UNITED DIECUTTING & FINISHING   3610 HAMILTON AVE    CLEVELAND    Ohio    44144
10033         UNITED FINISHING                3610 HAMILTON AVE    CLEVLAND     Ohio    44114
10033         UNITED FINISHING & DIE CUTTING  3610 HAMILTON AVE    CLEVELAND    Ohio    44114

And all I want is 1 record. Is there some way I can get the "Average" record? Meaning, if most of the records say CLEVELAND instead of CLEVLAND, I want my 1 record to say CLEVELAND. Is there any way to par this data down to what I'm looking for?

Desired Output

 CompanyCode   CompanyName                     Addr1                City         State   Zip
 10033         UNITED DIE CUTTING & FINISHING  3610 HAMILTON AVE    CLEVELAND    Ohio    44114

回答1:

You need to use subqueries in the select statement Try this query:

 select CompanyCode,
    (select top 1 CompanyName from Table1 where CompanyCode=X.CompanyCode 
     group by CompanyName order by count(*) desc) CompanyName,
    (select top 1 Addr1 from Table1 where CompanyCode=X.CompanyCode 
     group by Addr1 order by count(*) desc) Addr1,
    (select top 1 City from Table1 where CompanyCode=X.CompanyCode 
     group by City order by count(*) desc) City,
    (select top 1 State from Table1 where CompanyCode=X.CompanyCode 
     group by State order by count(*) desc) State,
    (select top 1 Zip from Table1 where CompanyCode=X.CompanyCode 
     group by Zip order by count(*) desc) Zip
from    Table1 X
group by CompanyCode

回答2:

You are going to struggle. Personally I think I'd consider having a process that tries to update the data in the database and correct it.

You could change the system that inputs the data (or if that's not possible, have an external process that runs over the data once it's in the db) that does something like the following:

Against known lists of things like towns/states/countries etc to catch typos.
For known regular mistakes and abbreviations. E.g. "Avenue"/"Ave." or "Street"/"St." and normalises the values.
Change the input system to do this kind of validation and/or provide the users with an address search/validation UI that allows then to search for an address given some known values (zip/postal code etc). You can buy data like this from various suppliers depending on where you are in the world.

If this all works (I doubt you'll get 100% unless you provide a mechanism whereby those things that can't be auto-corrected are flagged for human intervention), then your reporting is as simple as SELECT DISTINCT...

回答3:

Is it one-time job, I hope? It's impossible unless you can explain (in SQL terms) why first record is what you need. As an temporary solution I'd suggest following query

select C1.* from Company C1, 
(select CompanyCode, min(CompanyName) as CompanyNameSelected 
   from Company
   group by CompanyCode) C2
where 
   C1.CompanyCode = C2.CompanyCode and 
   C1.CompanyName = C2.CompanyNameSelected;

You could use any of aggregation functions instead of min (returning CompanyName of course), or even write your own stored function, but the only thing is needed - you have to explain in query language why record #1 is better than #2.

来源：https://stackoverflow.com/questions/4620396/addresses-stored-in-sql-server-have-many-small-variationserrors

标签

sql

tsql

sql-server-2000