Finding duplicates in a List ignoring a field

问题

I've got a List of Persons and I want to find duplicate entries, consindering all fields except id. So using the equals()-method (and in consequence List.contains()), because they take id into consideration.

public class Person {
    private String firstname, lastname;
    private int age;
    private long id;
}

Modifying the equals() and hashCode()-methods to ignore the id field are not an option because other parts of the code rely on this.

What's the most efficient way in Java to sort out the duplicates if I want to ignore the id field?

回答1:

Build a Comparator<Person> to implement your natural-key ordering and then use a binary-search based deduplication. TreeSet will give you this ability out of the box.

Note that Comparator<T>.compare(a, b) must fulfil the usual antisymmetry, transitivity, consistency and reflexivity requirements or the binary search ordering will fail. You should also make it null-aware (e.g. if the firstname field of one, other or both are null).

A simple natural-key comparator for your Person class is as follows (it is a static member class as you haven't shown if you have accessors for each field).

public class Person {
    public static class NkComparator implements Comparator<Person>
    {
        public int compare(Person p1, Person p2)
        {
            if (p1 == null || p2 == null) throw new NullPointerException();
            if (p1 == p2) return 0;
            int i = nullSafeCompareTo(p1.firstname, p2.firstname);
            if (i != 0) return i;
            i = nullSafeCompareTo(p1.lastname, p2.lastname);
            if (i != 0) return i;
            return p1.age - p2.age;
        }
        private static int nullSafeCompareTo(String s1, String s2)
        {
            return (s1 == null)
                    ? (s2 == null) ? 0 : -1
                    : (s2 == null) ? 1 : s1.compareTo(s2);
        }
    }
    private String firstname, lastname;
    private int age;
    private long id;
}

You can then use it to generate a unique list. Use the add method which returns true if and only if the element didn't already exist in the set:

List<Person> newList = new ArrayList<Person>();
TreeSet<Person> nkIndex = new TreeSet<Person>(new Person.NkComparator());
for (Person p : originalList)
    if (nkIndex.add(p)) newList.add(p); // to generate a unique list

or swap the final line for this line to output duplicates instead

    if (nkIndex.add(p)) newList.add(p);

Whatever you do, don't use remove on your original list while you are enumerating it, that's why these methods add your unique elements to a new list.

If you are just interested in a unique list, and want to use as few lines as possible:

TreeSet<Person> set = new TreeSet<Person>(new Person.NkComparator());
set.addAll(originalList);
List<Person> newList = new ArrayList<Person>(set);

回答2:

I would advise against using a Comparator to do this. It is quite difficult to write a legal compare() method based on the other fields.

I think a better solution would be to create a class PersonWithoutId like so:

public PersonWithoutId {
  private String firstname, lastname;
  private int age;
  // no id field
  public PersonWithoutId(Person original) { /* copy fields from Person */ }
  @Overrides public boolean equals() { /* compare these 3 fields */ }
  @Overrides public int hashCode() { /* hash these 3 fields */ }
}

Then, given a List<Person> called people you can do this:

Set<PersonWithoutId> set = new HashSet<>();
for (Iterator<Person> i = people.iterator(); i.hasNext();) 
    if (!set.add(new PersonWithoutId(i.next())))
        i.remove();

Edit

As others have pointed out in the comments, this solution is not ideal as it creates a load of objects for the garbage collector to deal with. But this solution is much faster than a solution using a Comparator and a TreeSet. Keeping a Set in order takes time and it has nothing to do with the original problem. I tested this on Lists of 1,000,000 instances of Person constructed using

new Person(
    "" + rand.nextInt(500),  // firstname 
    "" + rand.nextInt(500),  // lastname
    rand.nextInt(100),       // age
    rand.nextLong())         // id

and found this solution to be roughly twice as fast as a solution using a TreeSet. (Admittedly I used System.nanoTime() rather than proper benchmarking).

So how can you do this efficiently without creating loads of unnecessary objects? Java doesn't make it easy. One way would be to write two new methods in Person

boolean equalsIgnoringId(Person other) { ... }

int hashCodeIgnoringId() { ... }

and then to write a custom implementation of Set<Person> where you basically cut and paste the code for HashSet except you replace equals() and hashCode() by equalsIgnoringId() and hashCodeIgnoringId() .

In my humble opinion, the fact that you can create a TreeSet that uses a Comparator, but not a HashSet that uses custom versions of equals/hashCode is a serious flaw in the language.

回答3:

As @LuiggiMendoza suggested in the comments:

You could create a custom Comparator class that compares two Person objects for equality, ignoring their ids.

class PersonComparator implements Comparator<Person> {

    // wraps the compareTo method to compare two Strings but also accounts for NPE
    int compareStrings(String a, String b) {
        if(a == b) {           // both strings are the same string or are null
          return 0;
        } else if(a == null) { // first string is null, result is negative
            return -1;
        } else if(b == null){  // second string is null, result is positive
            return 1;
        } else {               // no strings are null, return the result of compareTo
            return a.compareTo(b);
        }
    }

    @Override
    public int compare(Person p1, Person p2) {

        // comparisons on Person objects themselves
        if(p1 == p2) {                 // Person 1 and Person 2 are the same Person object
            return 0;
        }
        if(p1 == null && p2 != null) { // Person 1 is null and Person 2 is not, result is negative
            return -1;
        }
        if(p1 != null && p2 == null) { // Person 1 is not null and Person 2 is, result is positive
            return 1;
        }

        int result = 0;

        // comparisons on the attributes of the Persons objects
        result = compareStrings(p1.firstname, p2.firstname);
        if(result != 0) {   // Persons differ in first names, we can return the result
            return result;
        }
        result = compareStrings(p1.lastname, p2.lastname);
        if(result != 0) {  // Persons differ in last names, we can return the result
            return result;
        }

        return Integer.compare(p1.age, p2.age); // if both first name and last names are equal, the comparison difference is in their age
    }
}

Now you can use the TreeSet structure with this custom Comparator and, for example, make a simple method that eliminates the duplicate values.

List<Person> getListWithoutDups(List<Person> list) {
    List<Person> newList = new ArrayList<Person>();
    TreeSet<Person> set = new TreeSet<Person>(new PersonComparator()); // use custom Comparator here

    // foreach Person in the list
    for(Person person : list) {
        // if the person isn't already in the set (meaning it's not a duplicate)
        // add it to the set and the new list
        if(!set.contains(person)) {
            set.add(person);
            newList.add(person);
        }
        // otherwise it's a duplicate so we don't do anything
    }

    return newList;
}

The contains operation in the TreeSet, as the documentation says, "provides guaranteed log(n) time cost".

The method I suggested above take O(n*log(n)) time since we are performing the contains operation on each list element but it also uses O(n) space for creating a new list and the TreeSet.

If your list is quite large (space is quite important) but you processing speed isn't an issue, then instead of adding each non-duplicate to the list, you could remove each duplicate that is found:

 List<Person> getListWithoutDups(List<Person> list) {
    TreeSet<Person> set = new TreeSet<Person>(new PersonComparator()); // use custom Comparator here
    Person person;
    // for every Person in the list
    for(int i = 0; i < list.size(); i++) {
        person = list.get(i);
        // if the person is already in the set (meaning it is a duplicate)
        // remove it from the list
        if(set.contains(person) { 
            list.remove(i);
            i--; // make sure to accommodate for the list shifting after removal
        } 
        // otherwise add it to the set of non-duplicates
        else {
            set.add(person);
        }
    }
    return list;
}

Since each remove operation on a list takes O(n) time (because the list gets shifted each time an element is deleted), and each contains operation takes log(n) time, this approach would be O(n^2 log(n)) in time.

However, the space complexity would be halved since we only create the TreeSet and not the second list.

回答4:

You can use Java HashMap using <K,V> pairs. Map<K,V> map = new HashMap<K,V>(). Also, some form of Comparator implementation to go with. If you check with containsKey or containsValue methods and find out you already have something (i.e. you are trying to add a duplicate, keep them in your original list. Otherwise, pop them out. In this way, you will end up with a list with the elements that were duplicates in your original list. TreeSet<,>will be another option, but I haven't used it yet so cannot offer advice.

来源：https://stackoverflow.com/questions/27926111/finding-duplicates-in-a-list-ignoring-a-field

标签

java

list

duplicate-removal