Algorithm Question on File Search Indexing

前端 未结 2 1386
离开以前
离开以前 2021-02-14 04:41

There is one question and I have the solution to it also. But I couldn\'t understand the solution. Kindly help with some set of examples and shower some experience.

相关标签:
2条回答
  • 2021-02-14 04:51

    To make the explanation simpler, let's say you have a list of two-digit numbers, where each digit is between 0 and 3, but you can't spare the 16 bits to remember for each of the 16 possible numbers, whether you have already encountered it. What you do is to create an array a of 4 3-bit integers and in a[i], you store how many numbers with the first digit i you encountered. (Two-bit integers wouldn't be enough, because you need the values 0, 4 and all numbers between them.)

    If you had the file

    00, 12, 03, 31, 01, 32, 02

    your array would look like this:

    4, 1, 0, 2

    Now you know that all numbers starting with 0 are in the file, but for each of the remaining, there is at least one missing. Let's pick 1. We know there is at least one number starting with 1 that is not in the file. So, create an array of 4 bits, for each number starting with 1 set the appropriate bit and in the end, pick one of the bits that wasn't set, in our example if could be 0. Now we have the solution: 10.

    In this case, using this method is the difference between 12 bits and 16 bits. With your numbers, it's the difference between 32 kB and 119 MB.

    0 讨论(0)
  • 2021-02-14 05:03

    In round terms, you have about 1/3 of the numbers that could exist in the file, assuming no duplicates.

    The idea is to make two passes through the data. Treat each number as a 32-bit (unsigned) number. In the first pass, keep a track of how many numbers have the same number in the most significant 16 bits. In practice, there will be a number of codes where there are zero (all those for 10-digit SSNs, for example; quite likely, all those with a zero for the first digit are missing too). But of the ranges with a non-zero count, most will not have 65536 entries, which would be how many would appear if there were no gaps in the range. So, with a bit of care, you can choose one of the ranges to concentrate on in the second pass.

    If you're lucky, you can find a range in the 100,000,000..999,999,999 with zero entries - you can choose any number from that range as missing.

    Assuming you aren't quite that lucky, choose one with the lowest number of bits (or any of them with less than 65536 entries); call it the target range. Reset the array to all zeroes. Reread the data. If the number you read is not in your target range, ignore it. If it is in the range, record the number by setting the array value to 1 for the low-order 16-bits of the number. When you've read the whole file, any of the numbers with a zero in the array represents a missing SSN.

    0 讨论(0)
提交回复
热议问题