How to efficiently (performance) remove many items from List in Java?

后端 未结 12 639
迷失自我
迷失自我 2021-01-31 09:00

I have quite large List named items (>= 1,000,000 items) and some condition denoted by that selects items to be deleted and is true for many (maybe hal

相关标签:
12条回答
  • 2021-01-31 09:28

    Since speed is the most important metric, there's the possibility of using more memory and doing less recreation of lists (as mentioned in my comment). Actual performance impact would be fully dependent on how the functionality is used, though.

    The algorithm assumes that at least one of the following is true:

    • all elements of the original list do not need to be tested. This could happen if we're really looking for the first N elements that match our condition, rather than all elements that match our condition.
    • it's more expensive to copy the list into new memory. This could happen if the original list uses more than 50% of allocated memory, so working in-place could be better or if memory operations turn out to be slower (that would be an unexpected result).
    • the speed penalty of removing elements from the list is too large to accept all at once, but spreading that penalty across multiple operations is acceptable, even if the overall penalty is larger than taking it all at once. This is like taking out a $200K mortgage: paying $1000 per month for 30 years is affordable on a monthly basis and has the benefits of owning a home and equity, even though the overall payment is 360K over the life of the loan.

    Disclaimer: There's prolly syntax errors - I didn't try compiling anything.

    First, subclass the ArrayList

    public class ConditionalArrayList extends ArrayList {
    
      public Iterator iterator(Condition condition)
      { 
        return listIterator(condition);
      }
    
      public ListIterator listIterator(Condition condition)
      {
        return new ConditionalArrayListIterator(this.iterator(),condition); 
      }
    
      public ListIterator listIterator(){ return iterator(); }
      public iterator(){ 
        throw new InvalidArgumentException("You must specify a condition for the iterator"); 
      }
    }
    

    Then we need the helper classes:

    public class ConditionalArrayListIterator implements ListIterator
    {
      private ListIterator listIterator;
      Condition condition;
    
      // the two following flags are used as a quick optimization so that 
      // we don't repeat tests on known-good elements unnecessarially.
      boolean nextKnownGood = false;
      boolean prevKnownGood = false;
    
      public ConditionalArrayListIterator(ListIterator listIterator, Condition condition)
      {
        this.listIterator = listIterator;
        this.condition = condition;
      }
    
      public void add(Object o){ listIterator.add(o); }
    
      /**
       * Note that this it is extremely inefficient to 
       * call hasNext() and hasPrev() alternatively when
       * there's a bunch of non-matching elements between
       * two matching elements.
       */
      public boolean hasNext()
      { 
         if( nextKnownGood ) return true;
    
         /* find the next object in the list that 
          * matches our condition, if any.
          */
         while( ! listIterator.hasNext() )
         {
           Object next = listIterator.next();
           if( condition.matches(next) ) {
             listIterator.set(next);
             nextKnownGood = true;
             return true;
           }
         }
    
         nextKnownGood = false;
         // no matching element was found.
         return false;
      }
    
      /**
       *  See hasPrevious for efficiency notes.
       *  Copy & paste of hasNext().
       */
      public boolean hasPrevious()
      { 
         if( prevKnownGood ) return true;
    
         /* find the next object in the list that 
          * matches our condition, if any.
          */
         while( ! listIterator.hasPrevious() )
         {
           Object prev = listIterator.next();
           if( condition.matches(prev) ) {
             prevKnownGood = true;
             listIterator.set(prev);
             return true;
           }
         }
    
         // no matching element was found.
         prevKnwonGood = false;
         return false;
      }
    
      /** see hasNext() for efficiency note **/
      public Object next()
      {
         if( nextKnownGood || hasNext() ) 
         { 
           prevKnownGood = nextKnownGood;
           nextKnownGood = false;
           return listIterator.next();
         }
    
         throw NoSuchElementException("No more matching elements");
      }
    
      /** see hasNext() for efficiency note; copy & paste of next() **/
      public Object previous()
      {
         if( prevKnownGood || hasPrevious() ) 
         { 
           nextKnownGood = prevKnownGood;
           prevKnownGood = false;
           return listIterator.previous();                        
         }
         throw NoSuchElementException("No more matching elements");
      }
    
      /** 
       * Note that nextIndex() and previousIndex() return the array index
       * of the value, not the number of results that this class has returned.
       * if this isn't good for you, just maintain your own current index and
       * increment or decriment in next() and previous()
       */
      public int nextIndex(){ return listIterator.previousIndex(); }
      public int previousIndex(){ return listIterator.previousIndex(); }
    
      public remove(){ listIterator.remove(); }
      public set(Object o) { listIterator.set(o); }
    }
    

    and, of course, we need the condition interface:

    /** much like a comparator... **/
    public interface Condition
    {
      public boolean matches(Object obj);
    }
    

    And a condition with which to test

    public class IsEvenCondition {
    {
      public boolean matches(Object obj){ return (Number(obj)).intValue() % 2 == 0;
    }
    

    and we're finally ready for some test code

    
        Condition condition = new IsEvenCondition();
    
        System.out.println("preparing items");
        startMillis = System.currentTimeMillis();
        List<Integer> items = new ArrayList<Integer>(); // Integer is for demo
        for (int i = 0; i < 1000000; i++) {
            items.add(i * 3); // just for demo
        }
        endMillis = System.currentTimeMillis();
        System.out.println("It took " + (endmillis-startmillis) + " to prepare the list. ");
    
        System.out.println("deleting items");
        startMillis = System.currentTimeMillis();
        // we don't actually ever remove from this list, so 
        // removeMany is effectively "instantaneous"
        // items = removeMany(items);
        endMillis = System.currentTimeMillis();
        System.out.println("after remove: items.size=" + items.size() + 
                " and it took " + (endMillis - startMillis) + " milli(s)");
        System.out.println("--> NOTE: Nothing is actually removed.  This algorithm uses extra"
                           + " memory to avoid modifying or duplicating the original list.");
    
        System.out.println("About to iterate through the list");
        startMillis = System.currentTimeMillis();
        int count = iterate(items, condition);
        endMillis = System.currentTimeMillis();
        System.out.println("after iteration: items.size=" + items.size() + 
                " count=" + count + " and it took " + (endMillis - startMillis) + " milli(s)");
        System.out.println("--> NOTE: this should be somewhat inefficient."
                           + " mostly due to overhead of multiple classes."
                           + " This algorithm is designed (hoped) to be faster than "
                           + " an algorithm where all elements of the list are used.");
    
        System.out.println("About to iterate through the list");
        startMillis = System.currentTimeMillis();
        int total = addFirst(30, items, condition);
        endMillis = System.currentTimeMillis();
        System.out.println("after totalling first 30 elements: total=" + total + 
                " and it took " + (endMillis - startMillis) + " milli(s)");
    
    ...
    
    private int iterate(List<Integer> items, Condition condition)
    {
      // the i++ and return value are really to prevent JVM optimization
      // - just to be safe.
      Iterator iter = items.listIterator(condition);
      for( int i=0; iter.hasNext()); i++){ iter.next(); }
      return i;
    }
    
    private int addFirst(int n, List<Integer> items, Condition condition)
    {
      int total = 0;
      Iterator iter = items.listIterator(condition);
      for(int i=0; i<n;i++)
      {
        total += ((Integer)iter.next()).intValue();
      }
    }
    
    
    0 讨论(0)
  • 2021-01-31 09:28

    Rather than muddying my first answer, which is already rather long, here's a second, related option: you can create your own ArrayList, and flag things as "removed". This algoritm makes the assumptions:

    • it's better to waste time (lower speed) during construction than to do the same during the removal operation. In other words, it moves the speed penalty from one location to another.
    • it's better to waste memory now, and time garbage collecting after the result is computeed rather than spend the time up front (you're always stuck with time garbage collecting...).
    • once removal begins, elements will never be added to the list (otherwise there are issues with re-allocating the flags object)

    Also, this is, again, not tested so there's prlolly syntax errors.

    public class FlaggedList extends ArrayList {
      private Vector<Boolean> flags = new ArrayList();
      private static final String IN = Boolean.TRUE;  // not removed
      private static final String OUT = Boolean.FALSE; // removed
      private int removed = 0;
    
      public MyArrayList(){ this(1000000); }
      public MyArrayList(int estimate){
        super(estimate);
        flags = new ArrayList(estimate);
      }
    
      public void remove(int idx){
        flags.set(idx, OUT);
        removed++;
      }
    
      public boolean isRemoved(int idx){ return flags.get(idx); }
    }
    

    and the iterator - more work may be needed to keep it synchronized, and many methods are left out, this time:

    public class FlaggedListIterator implements ListIterator
    {
      int idx = 0;
    
      public FlaggedList list;
      public FlaggedListIterator(FlaggedList list)
      {
        this.list = list;
      }
      public boolean hasNext() {
        while(idx<list.size() && list.isRemoved(idx++)) ;
        return idx < list.size();
      }
    }
    
    0 讨论(0)
  • 2021-01-31 09:32

    I'm sorry, but all these answers are missing the point, I think: You probably don't have to, and probably shouldn't, use a List.

    If this kind of "query" is common, why not build an ordered data structure that eliminates the need to traverse all the data nodes? You don't tell us enough about the problem, but given the example you provide a simple tree could do the trick. There's an insertion overhead per item, but you can very quickly find the subtree containing nodes that match , and you therefore avoid most of the comparisons you're doing now.

    Furthermore:

    • Depending on the exact problem, and the exact data structure you set up, you can speed up deletion -- if the nodes you want to kill do reduce to a subtree or something of the sort, you just drop that subtree, rather than updating a whole slew of list nodes.

    • Each time you remove a list item, you are updating pointers -- eg lastNode.next and nextNode.prev or something -- but if it turns out you also want to remove the nextNode, then the pointer update you just caused is thrown away by a new update.)

    0 讨论(0)
  • 2021-01-31 09:33

    Removing a lot of elements from an ArrayList is an O(n^2) operation. I would recommend simply using a LinkedList that's more optimized for insertion and removal (but not for random access). LinkedList has a bit of a memory overhead.

    If you do need to keep ArrayList, then you are better off creating a new list.

    Update: Comparing with creating a new list:

    Reusing the same list, the main cost is coming from deleting the node and updating the appropriate pointers in LinkedList. This is a constant operation for any node.

    When constructing a new list, the main cost is coming from creating the list, and initializing array entries. Both are cheap operations. You might incurre the cost of resizing the new list backend array as well; assuming that the final array is larger than half of the incoming array.

    So if you were to remove only one element, then LinkedList approach is probably faster. If you were to delete all nodes except for one, probably the new list approach is faster.

    There are more complications when you bring memory management and GC. I'd like to leave these out.

    The best option is to implement the alternatives yourself and benchmark the results when running your typical load.

    0 讨论(0)
  • 2021-01-31 09:34

    I would imagine that building a new list, rather than modifying the existing list, would be more performant - especially when the number of items is as large as you indicate. This assumes, your list is an ArrayList, not a LinkedList. For a non-circular LinkedList, insertion is O(n), but removal at an existing iterator position is O(1); in which case your naive algorithm should be sufficiently performant.

    Unless the list is a LinkedList, the cost of shifting the list each time you call remove() is likely one of the most expensive parts of the implementation. For array lists, I would consider using:

    public static <T> List<T> removeMany(List<T> items) {
        List<T> newList = new ArrayList<T>(items.size());
        Iterator<T> iter = items.iterator();
        while (iter.hasNext()) {
            T item = iter.next();
            // <cond> goes here
            if (/*<cond>: */i++ % 2 != 0) {
                newList.add(item);
            }
        }
        return newList;
    }
    
    0 讨论(0)
  • 2021-01-31 09:35

    Use Apache Commons Collections. Specifically this function. This is implemented in essentially the same way that people are suggesting that you implement it (i.e. create a new list and then add to it).

    0 讨论(0)
提交回复
热议问题