问题
I've got a List
of Persons and I want to find duplicate entries, consindering all fields except id
. So using the equals()
-method (and in consequence List.contains()
), because they take id
into consideration.
public class Person {
private String firstname, lastname;
private int age;
private long id;
}
Modifying the equals()
and hashCode()
-methods to ignore the id
field are not an option because other parts of the code rely on this.
What's the most efficient way in Java to sort out the duplicates if I want to ignore the id
field?
回答1:
Build a Comparator<Person> to implement your natural-key ordering and then use a binary-search based deduplication. TreeSet will give you this ability out of the box.
Note that Comparator<T>.compare(a, b) must fulfil the usual antisymmetry, transitivity, consistency and reflexivity requirements or the binary search ordering will fail. You should also make it null-aware (e.g. if the firstname field of one, other or both are null).
A simple natural-key comparator for your Person class is as follows (it is a static member class as you haven't shown if you have accessors for each field).
public class Person {
public static class NkComparator implements Comparator<Person>
{
public int compare(Person p1, Person p2)
{
if (p1 == null || p2 == null) throw new NullPointerException();
if (p1 == p2) return 0;
int i = nullSafeCompareTo(p1.firstname, p2.firstname);
if (i != 0) return i;
i = nullSafeCompareTo(p1.lastname, p2.lastname);
if (i != 0) return i;
return p1.age - p2.age;
}
private static int nullSafeCompareTo(String s1, String s2)
{
return (s1 == null)
? (s2 == null) ? 0 : -1
: (s2 == null) ? 1 : s1.compareTo(s2);
}
}
private String firstname, lastname;
private int age;
private long id;
}
You can then use it to generate a unique list. Use the add method which returns true
if and only if the element didn't already exist in the set:
List<Person> newList = new ArrayList<Person>();
TreeSet<Person> nkIndex = new TreeSet<Person>(new Person.NkComparator());
for (Person p : originalList)
if (nkIndex.add(p)) newList.add(p); // to generate a unique list
or swap the final line for this line to output duplicates instead
if (nkIndex.add(p)) newList.add(p);
Whatever you do, don't use remove
on your original list while you are enumerating it, that's why these methods add your unique elements to a new list.
If you are just interested in a unique list, and want to use as few lines as possible:
TreeSet<Person> set = new TreeSet<Person>(new Person.NkComparator());
set.addAll(originalList);
List<Person> newList = new ArrayList<Person>(set);
回答2:
I would advise against using a Comparator
to do this. It is quite difficult to write a legal compare()
method based on the other fields.
I think a better solution would be to create a class PersonWithoutId
like so:
public PersonWithoutId {
private String firstname, lastname;
private int age;
// no id field
public PersonWithoutId(Person original) { /* copy fields from Person */ }
@Overrides public boolean equals() { /* compare these 3 fields */ }
@Overrides public int hashCode() { /* hash these 3 fields */ }
}
Then, given a List<Person>
called people
you can do this:
Set<PersonWithoutId> set = new HashSet<>();
for (Iterator<Person> i = people.iterator(); i.hasNext();)
if (!set.add(new PersonWithoutId(i.next())))
i.remove();
Edit
As others have pointed out in the comments, this solution is not ideal as it creates a load of objects for the garbage collector to deal with. But this solution is much faster than a solution using a Comparator
and a TreeSet
. Keeping a Set
in order takes time and it has nothing to do with the original problem. I tested this on List
s of 1,000,000 instances of Person
constructed using
new Person(
"" + rand.nextInt(500), // firstname
"" + rand.nextInt(500), // lastname
rand.nextInt(100), // age
rand.nextLong()) // id
and found this solution to be roughly twice as fast as a solution using a TreeSet
. (Admittedly I used System.nanoTime()
rather than proper benchmarking).
So how can you do this efficiently without creating loads of unnecessary objects? Java doesn't make it easy. One way would be to write two new methods in Person
boolean equalsIgnoringId(Person other) { ... }
int hashCodeIgnoringId() { ... }
and then to write a custom implementation of Set<Person>
where you basically cut and paste the code for HashSet
except you replace equals()
and hashCode()
by equalsIgnoringId()
and hashCodeIgnoringId()
.
In my humble opinion, the fact that you can create a TreeSet
that uses a Comparator
, but not a HashSet
that uses custom versions of equals
/hashCode
is a serious flaw in the language.
回答3:
As @LuiggiMendoza suggested in the comments:
You could create a custom Comparator
class that compares two Person
objects for equality, ignoring their ids.
class PersonComparator implements Comparator<Person> {
// wraps the compareTo method to compare two Strings but also accounts for NPE
int compareStrings(String a, String b) {
if(a == b) { // both strings are the same string or are null
return 0;
} else if(a == null) { // first string is null, result is negative
return -1;
} else if(b == null){ // second string is null, result is positive
return 1;
} else { // no strings are null, return the result of compareTo
return a.compareTo(b);
}
}
@Override
public int compare(Person p1, Person p2) {
// comparisons on Person objects themselves
if(p1 == p2) { // Person 1 and Person 2 are the same Person object
return 0;
}
if(p1 == null && p2 != null) { // Person 1 is null and Person 2 is not, result is negative
return -1;
}
if(p1 != null && p2 == null) { // Person 1 is not null and Person 2 is, result is positive
return 1;
}
int result = 0;
// comparisons on the attributes of the Persons objects
result = compareStrings(p1.firstname, p2.firstname);
if(result != 0) { // Persons differ in first names, we can return the result
return result;
}
result = compareStrings(p1.lastname, p2.lastname);
if(result != 0) { // Persons differ in last names, we can return the result
return result;
}
return Integer.compare(p1.age, p2.age); // if both first name and last names are equal, the comparison difference is in their age
}
}
Now you can use the TreeSet
structure with this custom Comparator
and, for example, make a simple method that eliminates the duplicate values.
List<Person> getListWithoutDups(List<Person> list) {
List<Person> newList = new ArrayList<Person>();
TreeSet<Person> set = new TreeSet<Person>(new PersonComparator()); // use custom Comparator here
// foreach Person in the list
for(Person person : list) {
// if the person isn't already in the set (meaning it's not a duplicate)
// add it to the set and the new list
if(!set.contains(person)) {
set.add(person);
newList.add(person);
}
// otherwise it's a duplicate so we don't do anything
}
return newList;
}
The contains
operation in the TreeSet
, as the documentation says, "provides guaranteed log(n) time cost".
The method I suggested above take O(n*log(n))
time since we are performing the contains
operation on each list element but it also uses O(n)
space for creating a new list and the TreeSet
.
If your list is quite large (space is quite important) but you processing speed isn't an issue, then instead of adding each non-duplicate to the list, you could remove each duplicate that is found:
List<Person> getListWithoutDups(List<Person> list) {
TreeSet<Person> set = new TreeSet<Person>(new PersonComparator()); // use custom Comparator here
Person person;
// for every Person in the list
for(int i = 0; i < list.size(); i++) {
person = list.get(i);
// if the person is already in the set (meaning it is a duplicate)
// remove it from the list
if(set.contains(person) {
list.remove(i);
i--; // make sure to accommodate for the list shifting after removal
}
// otherwise add it to the set of non-duplicates
else {
set.add(person);
}
}
return list;
}
Since each remove
operation on a list takes O(n)
time (because the list gets shifted each time an element is deleted), and each contains
operation takes log(n)
time, this approach would be O(n^2 log(n))
in time.
However, the space complexity would be halved since we only create the TreeSet
and not the second list.
回答4:
You can use Java HashMap
using <K,V>
pairs. Map<K,V> map = new HashMap<K,V>()
. Also, some form of Comparator implementation to go with. If you check with containsKey or containsValue methods and find out you already have something (i.e. you are trying to add a duplicate, keep them in your original list. Otherwise, pop them out. In this way, you will end up with a list with the elements that were duplicates in your original list. TreeSet<,>will be another option, but I haven't used it yet so cannot offer advice.
来源:https://stackoverflow.com/questions/27926111/finding-duplicates-in-a-list-ignoring-a-field