找出不超过40亿个给定整数的整数

问题：

It is an interview question: 这是一个面试问题：

Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. 给定一个具有40亿个整数的输入文件，请提供一种算法来生成文件中不包含的整数。 Assume you have 1 GB memory. 假设您有1 GB的内存。 Follow up with what you would do if you have only 10 MB of memory. 如果只有10 MB的内存，请执行后续操作。

My analysis: 我的分析：

The size of the file is 4×10 ⁹ ×4 bytes = 16 GB. 文件大小为4×10 ⁹ ×4字节= 16 GB。

We can do external sorting, thus we get to know the range of the integers. 我们可以进行外部排序，因此我们可以了解整数的范围。 My question is what is the best way to detect the missing integer in the sorted big integer sets? 我的问题是在已排序的大整数集中检测丢失的整数的最佳方法是什么？

My understanding(after reading all answers): 我的理解（阅读所有答案后）：

Assuming we are talking about 32-bit integers. 假设我们正在谈论32位整数。 There are 2^32 = 4*10 ⁹ distinct integers. 有2 ^ 32 = 4 * 10 ^9个不同的整数。

Case 1: we have 1 GB = 1 10 ⁹ * 8 bits = 8 billion bits memory.* **情况1：我们有1 GB = 1 * 10 ⁹ * 8位= 80亿位内存。**

Solution: if we use one bit representing one distinct integer, it is enough. 解决方案：如果我们使用一位代表一个不同的整数，那就足够了。 we don't need sort. 我们不需要排序。 Implementation: 实现方式：

int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
    Scanner in = new Scanner(new FileReader("a.txt"));
    while(in.hasNextInt()){
        int n = in.nextInt();
        bitfield[n/radix] |= (1 << (n%radix));
    }

    for(int i = 0; i< bitfield.lenght; i++){
        for(int j =0; j<radix; j++){
            if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
        }
    }
}

Case 2: 10 MB memory = 10 10 ⁶ * 8 bits = 80 million bits* **情况2：10 MB内存= 10 * 10 ⁶ * 8位= 8000万位**

Solution: For all possible 16-bit prefixes, there are 2^16 number of integers = 65536, we need 2^16 * 4 * 8 = 2 million bits. 解决方案：对于所有可能的16位前缀，有2 ^ 16的整数数量= 65536，我们需要2 ^ 16 * 4 * 8 = 2百万个位。 We need build 65536 buckets. 我们需要构建65536个存储桶。 For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket. 对于每个存储桶，我们需要4个字节来保存所有可能性，因为最坏的情况是所有40亿个整数都属于同一个存储桶。

Build the counter of each bucket through the first pass through the file. 通过第一次遍历文件来构建每个存储桶的计数器。

Scan the buckets, find the first one who has less than 65536 hit. 扫描存储桶，找到命中率小于65536的第一个。

Build new buckets whose high 16-bit prefixes are we found in step2 through second pass of the file 通过在文件的第二遍中构建在步骤2中发现高16位前缀的新存储桶

Scan the buckets built in step3, find the first bucket which doesnt have a hit. 扫描步骤3中内置的存储桶，找到第一个没有命中的存储桶。

The code is very similar to above one. 该代码与上面的代码非常相似。

Conclusion: We decrease memory through increasing file pass. 结论：我们通过增加文件传递来减少内存。

A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file -- at least that's not how most people interpret it. 对于迟到者的说明：所问的问题不是说文件中没有正好包含一个整数-至少多数人不是这样解释的。 Many comments in the comment thread are about that variation of the task, though. 但是，注释线程中的许多注释都与任务的变化有关。 Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. 不幸的是，将其引入注释线程的注释后来被其作者删除了，所以现在看来，孤立的答复似乎误解了所有内容。 It's very confusing. 这非常令人困惑。 Sorry. 抱歉。

解决方案：

参考一： https://stackoom.com/question/U0zb/找出不超过-亿个给定整数的整数
参考二： https://oldbug.net/q/U0zb/Find-an-integer-not-among-four-billion-given-ones

来源：oschina

链接：https://my.oschina.net/stackoom/blog/4339332

标签

Buckets

找出不超过40亿个给定整数的整数

问题：

Case 1: we have 1 GB = 1 * 10 9 * 8 bits = 8 billion bits memory. 情况1：我们有1 GB = 1 * 10 9 * 8位= 80亿位内存。

Case 2: 10 MB memory = 10 * 10 6 * 8 bits = 80 million bits 情况2：10 MB内存= 10 * 10 6 * 8位= 8000万位

解决方案：

Case 1: we have 1 GB = 1 10 ⁹ * 8 bits = 8 billion bits memory.* **情况1：我们有1 GB = 1 * 10 ⁹ * 8位= 80亿位内存。**

Case 2: 10 MB memory = 10 10 ⁶ * 8 bits = 80 million bits* **情况2：10 MB内存= 10 * 10 ⁶ * 8位= 8000万位**