问题
Background
Over 5300 duplicate rows:
"id","latitude","longitude","country","region","city"
"2143220","41.3513889","68.9444444","KZ","10","Abay"
"2143218","40.8991667","68.5433333","KZ","10","Abay"
"1919381","33.8166667","49.6333333","IR","34","Ab Barik"
"1919377","35.6833333","50.1833333","IR","19","Ab Barik"
"1919432","29.55","55.5122222","IR","29","`Abbasabad"
"1919430","27.4263889","57.5725","IR","29","`Abbasabad"
"1919413","28.0011111","58.9005556","IR","12","`Abbasabad"
"1919435","36.5641667","61.14","IR","30","`Abbasabad"
"1919433","31.8988889","58.9211111","IR","30","`Abbasabad"
"1919422","33.8666667","48.3","IR","23","`Abbasabad"
"1919420","33.4658333","49.6219444","IR","23","`Abbasabad"
"1919438","33.5333333","49.9833333","IR","34","`Abbasabad"
"1919423","33.7619444","49.0747222","IR","24","`Abbasabad"
"1919419","34.2833333","49.2333333","IR","19","`Abbasabad"
"1919439","35.8833333","52.15","IR","35","`Abbasabad"
"1919417","35.9333333","52.95","IR","17","`Abbasabad"
"1919427","35.7341667","51.4377778","IR","26","`Abbasabad"
"1919425","35.1386111","51.6283333","IR","26","`Abbasabad"
"1919713","30.3705556","56.07","IR","29","`Abdolabad"
"1919711","27.9833333","57.7244444","IR","29","`Abdolabad"
"1919716","35.6025","59.2322222","IR","30","`Abdolabad"
"1919714","34.2197222","56.5447222","IR","30","`Abdolabad"
Additional details:
- PostgreSQL 8.4 Database
- Linux
Problem
Some values are obvious duplicates ("Abay" because the regions match and "Ab Barik" because the two locations are within such close proximity), others are not so obvious (and might not even be actual duplicates):
"1919430","27.4263889","57.5725","IR","29","`Abbasabad"
"1919435","36.5641667","61.14","IR","30","`Abbasabad"
The goal is to eliminate all duplicates.
Questions
Given a table of values such as the above CSV data:
- How would you eliminate duplicates?
- What geo-centric PostgreSQL functions would you use?
- What other criteria would you use to wheedle down the duplicates?
Update
Semi-working example code to select duplicate city names within the same country that are in close proximity (within 10 km):
select
c1.country, c1.name, c1.region_id, c2.region_id, c1.latitude_decimal, c1.longitude_decimal, c2.latitude_decimal, c2.longitude_decimal
from
climate.maxmind_city c1,
climate.maxmind_city c2
where
c1.country = 'BE' and
c1.id <> c2.id and
c1.country = c2.country and
c1.name = c2.name and
(c1.latitude_decimal <> c2.latitude_decimal or c1.longitude_decimal <> c2.longitude_decimal) and
earth_distance(
ll_to_earth( c1.latitude_decimal, c1.longitude_decimal ),
ll_to_earth( c2.latitude_decimal, c2.longitude_decimal ) ) <= 10
order by
country, name
Ideas
Two phase approach:
- Eliminate the obvious duplicates (same country, region, and city name) by removing the min(id).
- Eliminate those within close proximity of each other, having the same name and country. This could remove some legitimate cities, but hardly any of consequence.
Thank you!
回答1:
Finding duplicates is simple:
select
max(id) as this_should_stay,
latitude,
longitude,
country,
region,
city
FROM
your_table
group by
latitude,
longitude,
country,
region,
city
having count(*) > 1;
Adding code to remove duplicates based on this is simple:
delete from your_table where id not in (
select
max(id) as this_should_stay
FROM
your_table
group by
latitude,
longitude,
country,
region,
city
)
note lack of having in the delete query.
回答2:
This deletes the second city within close proximity to a city of the same name in the same country:
delete from climate.maxmind_city mc where id in (
select
max(c1.id)
from
climate.maxmind_city c1,
climate.maxmind_city c2
where
c1.id <> c2.id and
c1.country = c2.country and
c1.name = c2.name and
earth_distance(
ll_to_earth( c1.latitude_decimal, c1.longitude_decimal ),
ll_to_earth( c2.latitude_decimal, c2.longitude_decimal ) ) <= 35
group by
c1.country, c1.name
order by
c1.country, c1.name
)
回答3:
if your data have been imported thru CSV files and with the code (PHP) then you can prevent duplicates entry with the putting condition in PHP code. if the city you inserted is already exist then make loop continue to next record and skip current record.
try this if you are follow this way to import data in database..
Thanks.
来源:https://stackoverflow.com/questions/5816945/eliminate-duplicate-cities-from-database