Efficient way to get middle (median) of an std::set?

后端未结

关注

 5  1634

std::set is a sorted tree. It provides begin and end methods so I can get minimum and maximum and lower_bound and u


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  盖世英雄少女心        
                
              
                            
                2021-01-08 00:03
              
            
            
                                                                       
It's going to be O(size) to get the middle of a binary search tree. You can get it with std::advance() as follows:

std::set<int>::iterator it = s.begin();
std::advance(it, s.size() / 2);

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  隐瞒了意图╮        
                
              
                            
                2021-01-08 00:10
              
            
            
                                                                       
This suggestion is pure magic and will fail if there are some duplicated items


  Depending on how often you insert/remove items versus look up the middle/median, a possibly more efficient solution than the obvious one is to keep a persistent iterator to the middle element and update it whenever you insert/delete items from the set. There are a bunch of edge cases which will need handling (odd vs even number of items, removing the middle item, empty set, etc.), but the basic idea would be that when you insert an item that's smaller than the current middle item, your middle iterator may need decrementing, whereas if you insert a larger one, you need to increment. It's the other way around for removals.


Suggestions


first suggestion is to use a std::multiset instead of std::set, so that it can work well when items could be duplicated
my suggestion is to use 2 multisets to track the smaller potion and the bigger potion and balance the size between them


Algorithm

1. keep the sets balanced, so that size_of_small==size_of_big or size_of_small + 1 == size_of_big

void balance(multiset<int> &small, multiset<int> &big)
{
    while (true)
    {
        int ssmall = small.size();
        int sbig = big.size();

        if (ssmall == sbig || ssmall + 1 == sbig) break; // OK

        if (ssmall < sbig)
        {
            // big to small
            auto v = big.begin();
            small.emplace(*v);
            big.erase(v);
        }
        else 
        {
            // small to big
            auto v = small.end();
            --v;
            big.emplace(*v);
            small.erase(v);
        }
    }
}


2. if the sets are balanced, the medium item is always the first item in the big set

auto medium = big.begin();
cout << *medium << endl;


3. take caution when add a new item

auto v = big.begin();
if (v != big.end() && new_item > *v)
    big.emplace(new_item );
else
    small.emplace(new_item );

balance(small, big);


complexity explained


it is O(1) to find the medium value
add a new item takes O(log n)
you can still search a item in O(log n), but you need to search 2 sets

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦毁少年i        
                
              
                            
                2021-01-08 00:12
              
            
            
                                                                       
Depending on how often you insert/remove items versus look up the middle/median, a possibly more efficient solution than the obvious one is to keep a persistent iterator to the middle element and update it whenever you insert/delete items from the set. There are a bunch of edge cases which will need handling (odd vs even number of items, removing the middle item, empty set, etc.), but the basic idea would be that when you insert an item that's smaller than the current middle item, your middle iterator may need decrementing, whereas if you insert a larger one, you need to increment. It's the other way around for removals.

At lookup time, this is of course O(1), but it also has an essentially O(1) cost at each insertion/deletion, i.e. O(N) after N insertions, which needs to be amortised across a sufficient number of lookups to make it more efficient than brute forcing.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不思量自难忘°        
                
              
                            
                2021-01-08 00:16
              
            
            
                                                                       
If your data is static, those you could precalcate it and do not insert new elements - it’s simplier to use vector , sort it , and access median just by index in O(1)

vector<int> data;
// fill data
std::sort(data.begin(), data.end());
auto median = data[data.size() / 2];

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  無奈伤痛        
                
              
                            
                2021-01-08 00:28
              
            
            
                                                                       
Be aware that the std::set does NOT store duplicate values. If you insert the following values {1, 2, 3, 3, 3, 3, 3, 3, 3}, the median you will retrieve is 2.

std::set<int>::iterator it = s.begin();
std::advance(it, s.size() / 2);
int median = *it;


If you want to include duplicates when considering the median you can use std::multiset ({1, 2, 3, 3, 3, 3, 3, 3, 3} median's would be 3) :

std::multiset<int>::iterator it = s.begin();
std::advance(it, s.size() / 2);
int median = *it;


If the only reason you want the data sorted is to get the median, you are better off with a plain old std::vector + std::sort in my opinion.

With a large test sample and multiple iterations, I completed the test in 5s with std::vector and std::sort and 13 to 15s with either std::set or std::multiset. Your milage may vary depending on the size and number of duplicate values you have.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复