Creating random numbers with no duplicates

后端 未结 18 1769
忘了有多久
忘了有多久 2020-11-21 12:00

In this case, the MAX is only 5, so I could check the duplicates one by one, but how could I do this in a simpler way? For example, what if the MAX has a value of 20? Thanks

相关标签:
18条回答
  • 2020-11-21 12:16

    I created a snippet that generates no duplicate random integer. the advantage of this snippet is that you can assign the list of an array to it and generate the random item, too.

    No duplication random generator class

    0 讨论(0)
  • 2020-11-21 12:17

    Here's how I'd do it

    import java.util.ArrayList;
    import java.util.Random;
    
    public class Test {
        public static void main(String[] args) {
            int size = 20;
    
            ArrayList<Integer> list = new ArrayList<Integer>(size);
            for(int i = 1; i <= size; i++) {
                list.add(i);
            }
    
            Random rand = new Random();
            while(list.size() > 0) {
                int index = rand.nextInt(list.size());
                System.out.println("Selected: "+list.remove(index));
            }
        }
    }
    

    As the esteemed Mr Skeet has pointed out:
    If n is the number of randomly selected numbers you wish to choose and N is the total sample space of numbers available for selection:

    1. If n << N, you should just store the numbers that you have picked and check a list to see if the number selected is in it.
    2. If n ~= N, you should probably use my method, by populating a list containing the entire sample space and then removing numbers from it as you select them.
    0 讨论(0)
  • 2020-11-21 12:20

    You could use one of the classes implementing the Set interface (API), and then each number you generate, use Set.add() to insert it.

    If the return value is false, you know the number has already been generated before.

    0 讨论(0)
  • 2020-11-21 12:20

    There is a more efficient and less cumbersome solution for integers than a Collections.shuffle.

    The problem is the same as successively picking items from only the un-picked items in a set and setting them in order somewhere else. This is exactly like randomly dealing cards or drawing winning raffle tickets from a hat or bin.

    This algorithm works for loading any array and achieving a random order at the end of the load. It also works for adding into a List collection (or any other indexed collection) and achieving a random sequence in the collection at the end of the adds.

    It can be done with a single array, created once, or a numerically ordered collectio, such as a List, in place. For an array, the initial array size needs to be the exact size to contain all the intended values. If you don't know how many values might occur in advance, using a numerically orderred collection, such as an ArrayList or List, where the size is not immutable, will also work. It will work universally for an array of any size up to Integer.MAX_VALUE which is just over 2,000,000,000. List objects will have the same index limits. Your machine may run out of memory before you get to an array of that size. It may be more efficient to load an array typed to the object types and convert it to some collection, after loading the array. This is especially true if the target collection is not numerically indexed.

    This algorithm, exactly as written, will create a very even distribution where there are no duplicates. One aspect that is VERY IMPORTANT is that it has to be possible for the insertion of the next item to occur up to the current size + 1. Thus, for the second item, it could be possible to store it in location 0 or location 1. For the 20th item, it could be possible to store it in any location, 0 through 19. It is just as possible the first item to stay in location 0 as it is for it to end up in any other location. It is just as possible for the next new item to go anywhere, including the next new location.

    The randomness of the sequence will be as random as the randomness of the random number generator.

    This algorithm can also be used to load reference types into random locations in an array. Since this works with an array, it can also work with collections. That means you don't have to create the collection and then shuffle it or have it ordered on whatever orders the objects being inserted. The collection need only have the ability to insert an item anywhere in the collection or append it.

    // RandomSequence.java
    import java.util.Random;
    public class RandomSequence {
    
        public static void main(String[] args) {
            // create an array of the size and type for which
            // you want a random sequence
            int[] randomSequence = new int[20];
            Random randomNumbers = new Random();
    
            for (int i = 0; i < randomSequence.length; i++ ) {
                if (i == 0) { // seed first entry in array with item 0
                    randomSequence[i] = 0; 
                } else { // for all other items...
                    // choose a random pointer to the segment of the
                    // array already containing items
                    int pointer = randomNumbers.nextInt(i + 1);
                    randomSequence[i] = randomSequence[pointer]; 
                    randomSequence[pointer] = i;
                    // note that if pointer & i are equal
                    // the new value will just go into location i and possibly stay there
                    // this is VERY IMPORTANT to ensure the sequence is really random
                    // and not biased
                } // end if...else
            } // end for
            for (int number: randomSequence) {
                    System.out.printf("%2d ", number);
            } // end for
        } // end main
    } // end class RandomSequence
    
    0 讨论(0)
  • 2020-11-21 12:26

    The simplest way would be to create a list of the possible numbers (1..20 or whatever) and then shuffle them with Collections.shuffle. Then just take however many elements you want. This is great if your range is equal to the number of elements you need in the end (e.g. for shuffling a deck of cards).

    That doesn't work so well if you want (say) 10 random elements in the range 1..10,000 - you'd end up doing a lot of work unnecessarily. At that point, it's probably better to keep a set of values you've generated so far, and just keep generating numbers in a loop until the next one isn't already present:

    if (max < numbersNeeded)
    {
        throw new IllegalArgumentException("Can't ask for more numbers than are available");
    }
    Random rng = new Random(); // Ideally just create one instance globally
    // Note: use LinkedHashSet to maintain insertion order
    Set<Integer> generated = new LinkedHashSet<Integer>();
    while (generated.size() < numbersNeeded)
    {
        Integer next = rng.nextInt(max) + 1;
        // As we're adding to a set, this will automatically do a containment check
        generated.add(next);
    }
    

    Be careful with the set choice though - I've very deliberately used LinkedHashSet as it maintains insertion order, which we care about here.

    Yet another option is to always make progress, by reducing the range each time and compensating for existing values. So for example, suppose you wanted 3 values in the range 0..9. On the first iteration you'd generate any number in the range 0..9 - let's say you generate a 4.

    On the second iteration you'd then generate a number in the range 0..8. If the generated number is less than 4, you'd keep it as is... otherwise you add one to it. That gets you a result range of 0..9 without 4. Suppose we get 7 that way.

    On the third iteration you'd generate a number in the range 0..7. If the generated number is less than 4, you'd keep it as is. If it's 4 or 5, you'd add one. If it's 6 or 7, you'd add two. That way the result range is 0..9 without 4 or 6.

    0 讨论(0)
  • 2020-11-21 12:27

    Generating all the indices of a sequence is generally a bad idea, as it might take a lot of time, especially if the ratio of the numbers to be chosen to MAX is low (the complexity becomes dominated by O(MAX)). This gets worse if the ratio of the numbers to be chosen to MAX approaches one, as then removing the chosen indices from the sequence of all also becomes expensive (we approach O(MAX^2/2)). But for small numbers, this generally works well and is not particularly error-prone.

    Filtering the generated indices by using a collection is also a bad idea, as some time is spent in inserting the indices into the sequence, and progress is not guaranteed as the same random number can be drawn several times (but for large enough MAX it is unlikely). This could be close to complexity
    O(k n log^2(n)/2), ignoring the duplicates and assuming the collection uses a tree for efficient lookup (but with a significant constant cost k of allocating the tree nodes and possibly having to rebalance).

    Another option is to generate the random values uniquely from the beginning, guaranteeing progress is being made. That means in the first round, a random index in [0, MAX] is generated:

    items i0 i1 i2 i3 i4 i5 i6 (total 7 items)
    idx 0       ^^             (index 2)
    

    In the second round, only [0, MAX - 1] is generated (as one item was already selected):

    items i0 i1    i3 i4 i5 i6 (total 6 items)
    idx 1          ^^          (index 2 out of these 6, but 3 out of the original 7)
    

    The values of the indices then need to be adjusted: if the second index falls in the second half of the sequence (after the first index), it needs to be incremented to account for the gap. We can implement this as a loop, allowing us to select arbitrary number of unique items.

    For short sequences, this is quite fast O(n^2/2) algorithm:

    void RandomUniqueSequence(std::vector<int> &rand_num,
        const size_t n_select_num, const size_t n_item_num)
    {
        assert(n_select_num <= n_item_num);
    
        rand_num.clear(); // !!
    
        // b1: 3187.000 msec (the fastest)
        // b2: 3734.000 msec
        for(size_t i = 0; i < n_select_num; ++ i) {
            int n = n_Rand(n_item_num - i - 1);
            // get a random number
    
            size_t n_where = i;
            for(size_t j = 0; j < i; ++ j) {
                if(n + j < rand_num[j]) {
                    n_where = j;
                    break;
                }
            }
            // see where it should be inserted
    
            rand_num.insert(rand_num.begin() + n_where, 1, n + n_where);
            // insert it in the list, maintain a sorted sequence
        }
        // tier 1 - use comparison with offset instead of increment
    }
    

    Where n_select_num is your 5 and n_number_num is your MAX. The n_Rand(x) returns random integers in [0, x] (inclusive). This can be made a bit faster if selecting a lot of items (e.g. not 5 but 500) by using binary search to find the insertion point. To do that, we need to make sure that we meet the requirements.

    We will do binary search with the comparison n + j < rand_num[j] which is the same as
    n < rand_num[j] - j. We need to show that rand_num[j] - j is still a sorted sequence for a sorted sequence rand_num[j]. This is fortunately easily shown, as the lowest distance between two elements of the original rand_num is one (the generated numbers are unique, so there is always difference of at least 1). At the same time, if we subtract the indices j from all the elements
    rand_num[j], the differences in index are exactly 1. So in the "worst" case, we get a constant sequence - but never decreasing. The binary search can therefore be used, yielding O(n log(n)) algorithm:

    struct TNeedle { // in the comparison operator we need to make clear which argument is the needle and which is already in the list; we do that using the type system.
        int n;
    
        TNeedle(int _n)
            :n(_n)
        {}
    };
    
    class CCompareWithOffset { // custom comparison "n < rand_num[j] - j"
    protected:
        std::vector<int>::iterator m_p_begin_it;
    
    public:
        CCompareWithOffset(std::vector<int>::iterator p_begin_it)
            :m_p_begin_it(p_begin_it)
        {}
    
        bool operator ()(const int &r_value, TNeedle n) const
        {
            size_t n_index = &r_value - &*m_p_begin_it;
            // calculate index in the array
    
            return r_value < n.n + n_index; // or r_value - n_index < n.n
        }
    
        bool operator ()(TNeedle n, const int &r_value) const
        {
            size_t n_index = &r_value - &*m_p_begin_it;
            // calculate index in the array
    
            return n.n + n_index < r_value; // or n.n < r_value - n_index
        }
    };
    

    And finally:

    void RandomUniqueSequence(std::vector<int> &rand_num,
        const size_t n_select_num, const size_t n_item_num)
    {
        assert(n_select_num <= n_item_num);
    
        rand_num.clear(); // !!
    
        // b1: 3578.000 msec
        // b2: 1703.000 msec (the fastest)
        for(size_t i = 0; i < n_select_num; ++ i) {
            int n = n_Rand(n_item_num - i - 1);
            // get a random number
    
            std::vector<int>::iterator p_where_it = std::upper_bound(rand_num.begin(), rand_num.end(),
                TNeedle(n), CCompareWithOffset(rand_num.begin()));
            // see where it should be inserted
    
            rand_num.insert(p_where_it, 1, n + p_where_it - rand_num.begin());
            // insert it in the list, maintain a sorted sequence
        }
        // tier 4 - use binary search
    }
    

    I have tested this on three benchmarks. First, 3 numbers were chosen out of 7 items, and a histogram of the items chosen was accumulated over 10,000 runs:

    4265 4229 4351 4267 4267 4364 4257
    

    This shows that each of the 7 items was chosen approximately the same number of times, and there is no apparent bias caused by the algorithm. All the sequences were also checked for correctness (uniqueness of contents).

    The second benchmark involved choosing 7 numbers out of 5000 items. The time of several versions of the algorithm was accumulated over 10,000,000 runs. The results are denoted in comments in the code as b1. The simple version of the algorithm is slightly faster.

    The third benchmark involved choosing 700 numbers out of 5000 items. The time of several versions of the algorithm was again accumulated, this time over 10,000 runs. The results are denoted in comments in the code as b2. The binary search version of the algorithm is now more than two times faster than the simple one.

    The second method starts being faster for choosing more than cca 75 items on my machine (note that the complexity of either algorithm does not depend on the number of items, MAX).

    It is worth mentioning that the above algorithms generate the random numbers in ascending order. But it would be simple to add another array to which the numbers would be saved in the order in which they were generated, and returning that instead (at negligible additional cost O(n)). It is not necessary to shuffle the output: that would be much slower.

    Note that the sources are in C++, I don't have Java on my machine, but the concept should be clear.

    EDIT:

    For amusement, I have also implemented the approach that generates a list with all the indices
    0 .. MAX, chooses them randomly and removes them from the list to guarantee uniqueness. Since I've chosen quite high MAX (5000), the performance is catastrophic:

    // b1: 519515.000 msec
    // b2: 20312.000 msec
    std::vector<int> all_numbers(n_item_num);
    std::iota(all_numbers.begin(), all_numbers.end(), 0);
    // generate all the numbers
    
    for(size_t i = 0; i < n_number_num; ++ i) {
        assert(all_numbers.size() == n_item_num - i);
        int n = n_Rand(n_item_num - i - 1);
        // get a random number
    
        rand_num.push_back(all_numbers[n]); // put it in the output list
        all_numbers.erase(all_numbers.begin() + n); // erase it from the input
    }
    // generate random numbers
    

    I have also implemented the approach with a set (a C++ collection), which actually comes second on benchmark b2, being only about 50% slower than the approach with the binary search. That is understandable, as the set uses a binary tree, where the insertion cost is similar to binary search. The only difference is the chance of getting duplicate items, which slows down the progress.

    // b1: 20250.000 msec
    // b2: 2296.000 msec
    std::set<int> numbers;
    while(numbers.size() < n_number_num)
        numbers.insert(n_Rand(n_item_num - 1)); // might have duplicates here
    // generate unique random numbers
    
    rand_num.resize(numbers.size());
    std::copy(numbers.begin(), numbers.end(), rand_num.begin());
    // copy the numbers from a set to a vector
    

    Full source code is here.

    0 讨论(0)
提交回复
热议问题