Finding duplicates in a List ignoring a field

前端 未结 4 684
我寻月下人不归
我寻月下人不归 2021-01-22 12:38

I\'ve got a List of Persons and I want to find duplicate entries, consindering all fields except id. So using the equals()-method (and in

相关标签:
4条回答
  • 2021-01-22 13:07

    I would advise against using a Comparator to do this. It is quite difficult to write a legal compare() method based on the other fields.

    I think a better solution would be to create a class PersonWithoutId like so:

    public PersonWithoutId {
      private String firstname, lastname;
      private int age;
      // no id field
      public PersonWithoutId(Person original) { /* copy fields from Person */ }
      @Overrides public boolean equals() { /* compare these 3 fields */ }
      @Overrides public int hashCode() { /* hash these 3 fields */ }
    }
    

    Then, given a List<Person> called people you can do this:

    Set<PersonWithoutId> set = new HashSet<>();
    for (Iterator<Person> i = people.iterator(); i.hasNext();) 
        if (!set.add(new PersonWithoutId(i.next())))
            i.remove();
    

    Edit

    As others have pointed out in the comments, this solution is not ideal as it creates a load of objects for the garbage collector to deal with. But this solution is much faster than a solution using a Comparator and a TreeSet. Keeping a Set in order takes time and it has nothing to do with the original problem. I tested this on Lists of 1,000,000 instances of Person constructed using

    new Person(
        "" + rand.nextInt(500),  // firstname 
        "" + rand.nextInt(500),  // lastname
        rand.nextInt(100),       // age
        rand.nextLong())         // id
    

    and found this solution to be roughly twice as fast as a solution using a TreeSet. (Admittedly I used System.nanoTime() rather than proper benchmarking).

    So how can you do this efficiently without creating loads of unnecessary objects? Java doesn't make it easy. One way would be to write two new methods in Person

    boolean equalsIgnoringId(Person other) { ... }
    
    int hashCodeIgnoringId() { ... }
    

    and then to write a custom implementation of Set<Person> where you basically cut and paste the code for HashSet except you replace equals() and hashCode() by equalsIgnoringId() and hashCodeIgnoringId() .

    In my humble opinion, the fact that you can create a TreeSet that uses a Comparator, but not a HashSet that uses custom versions of equals/hashCode is a serious flaw in the language.

    0 讨论(0)
  • You can use Java HashMap using <K,V> pairs. Map<K,V> map = new HashMap<K,V>(). Also, some form of Comparator implementation to go with. If you check with containsKey or containsValue methods and find out you already have something (i.e. you are trying to add a duplicate, keep them in your original list. Otherwise, pop them out. In this way, you will end up with a list with the elements that were duplicates in your original list. TreeSet<,>will be another option, but I haven't used it yet so cannot offer advice.

    0 讨论(0)
  • 2021-01-22 13:20

    As @LuiggiMendoza suggested in the comments:

    You could create a custom Comparator class that compares two Person objects for equality, ignoring their ids.

    class PersonComparator implements Comparator<Person> {
    
        // wraps the compareTo method to compare two Strings but also accounts for NPE
        int compareStrings(String a, String b) {
            if(a == b) {           // both strings are the same string or are null
              return 0;
            } else if(a == null) { // first string is null, result is negative
                return -1;
            } else if(b == null){  // second string is null, result is positive
                return 1;
            } else {               // no strings are null, return the result of compareTo
                return a.compareTo(b);
            }
        }
    
        @Override
        public int compare(Person p1, Person p2) {
    
            // comparisons on Person objects themselves
            if(p1 == p2) {                 // Person 1 and Person 2 are the same Person object
                return 0;
            }
            if(p1 == null && p2 != null) { // Person 1 is null and Person 2 is not, result is negative
                return -1;
            }
            if(p1 != null && p2 == null) { // Person 1 is not null and Person 2 is, result is positive
                return 1;
            }
    
            int result = 0;
    
            // comparisons on the attributes of the Persons objects
            result = compareStrings(p1.firstname, p2.firstname);
            if(result != 0) {   // Persons differ in first names, we can return the result
                return result;
            }
            result = compareStrings(p1.lastname, p2.lastname);
            if(result != 0) {  // Persons differ in last names, we can return the result
                return result;
            }
    
            return Integer.compare(p1.age, p2.age); // if both first name and last names are equal, the comparison difference is in their age
        }
    }
    

    Now you can use the TreeSet structure with this custom Comparator and, for example, make a simple method that eliminates the duplicate values.

    List<Person> getListWithoutDups(List<Person> list) {
        List<Person> newList = new ArrayList<Person>();
        TreeSet<Person> set = new TreeSet<Person>(new PersonComparator()); // use custom Comparator here
    
        // foreach Person in the list
        for(Person person : list) {
            // if the person isn't already in the set (meaning it's not a duplicate)
            // add it to the set and the new list
            if(!set.contains(person)) {
                set.add(person);
                newList.add(person);
            }
            // otherwise it's a duplicate so we don't do anything
        }
    
        return newList;
    }
    

    The contains operation in the TreeSet, as the documentation says, "provides guaranteed log(n) time cost".

    The method I suggested above take O(n*log(n)) time since we are performing the contains operation on each list element but it also uses O(n) space for creating a new list and the TreeSet.

    If your list is quite large (space is quite important) but you processing speed isn't an issue, then instead of adding each non-duplicate to the list, you could remove each duplicate that is found:

     List<Person> getListWithoutDups(List<Person> list) {
        TreeSet<Person> set = new TreeSet<Person>(new PersonComparator()); // use custom Comparator here
        Person person;
        // for every Person in the list
        for(int i = 0; i < list.size(); i++) {
            person = list.get(i);
            // if the person is already in the set (meaning it is a duplicate)
            // remove it from the list
            if(set.contains(person) { 
                list.remove(i);
                i--; // make sure to accommodate for the list shifting after removal
            } 
            // otherwise add it to the set of non-duplicates
            else {
                set.add(person);
            }
        }
        return list;
    }
    

    Since each remove operation on a list takes O(n) time (because the list gets shifted each time an element is deleted), and each contains operation takes log(n) time, this approach would be O(n^2 log(n)) in time.

    However, the space complexity would be halved since we only create the TreeSet and not the second list.

    0 讨论(0)
  • 2021-01-22 13:31

    Build a Comparator<Person> to implement your natural-key ordering and then use a binary-search based deduplication. TreeSet will give you this ability out of the box.

    Note that Comparator<T>.compare(a, b) must fulfil the usual antisymmetry, transitivity, consistency and reflexivity requirements or the binary search ordering will fail. You should also make it null-aware (e.g. if the firstname field of one, other or both are null).

    A simple natural-key comparator for your Person class is as follows (it is a static member class as you haven't shown if you have accessors for each field).

    public class Person {
        public static class NkComparator implements Comparator<Person>
        {
            public int compare(Person p1, Person p2)
            {
                if (p1 == null || p2 == null) throw new NullPointerException();
                if (p1 == p2) return 0;
                int i = nullSafeCompareTo(p1.firstname, p2.firstname);
                if (i != 0) return i;
                i = nullSafeCompareTo(p1.lastname, p2.lastname);
                if (i != 0) return i;
                return p1.age - p2.age;
            }
            private static int nullSafeCompareTo(String s1, String s2)
            {
                return (s1 == null)
                        ? (s2 == null) ? 0 : -1
                        : (s2 == null) ? 1 : s1.compareTo(s2);
            }
        }
        private String firstname, lastname;
        private int age;
        private long id;
    }
    

    You can then use it to generate a unique list. Use the add method which returns true if and only if the element didn't already exist in the set:

    List<Person> newList = new ArrayList<Person>();
    TreeSet<Person> nkIndex = new TreeSet<Person>(new Person.NkComparator());
    for (Person p : originalList)
        if (nkIndex.add(p)) newList.add(p); // to generate a unique list
    

    or swap the final line for this line to output duplicates instead

        if (nkIndex.add(p)) newList.add(p); 
    

    Whatever you do, don't use remove on your original list while you are enumerating it, that's why these methods add your unique elements to a new list.

    If you are just interested in a unique list, and want to use as few lines as possible:

    TreeSet<Person> set = new TreeSet<Person>(new Person.NkComparator());
    set.addAll(originalList);
    List<Person> newList = new ArrayList<Person>(set);
    
    0 讨论(0)
提交回复
热议问题